Part 3: Tidymodels

On this page we are going to take a different approach to analysing the agroforestry data from ERA. We are going apply the tidymodels machine learning framework on the ‘raw’ agroforestry subset dataset from ERA. This process will therefore start, like ‘Part 2: Agroforestry ERAAnalyze,’ with a preliminary explorertive data analysis (EDA) in order to understand what necessary data wrangling cleaninig and pre-processing steps are needed prior to running the machine lerning algorithms. In the EDA we will make use of data visualisations to undestand our agroforestry data better. For the actual machine learning analysis we will follow the predifined steps of the tidymodels workflow. Press the “Show code” to view the R codes used to perform a given analysis or visualise a analysis output.

Here we are going to explore Machine Learning with Tidymodels on the ERA agroforestry data

Loading necessary R packages and ERA data

Loading general R packages

This part of the document is where we actually get to the nitty-gritty of the ERA agroforestry data and therefore it requires us to load a number of R packages for both general Explorortive Data Analysis and Machine Learning.

Show code
# Using the pacman functions to load required packages

        if(!require("pacman", character.only = TRUE)){
          install.packages("pacman",dependencies = T)
          }

# ------------------------------------------------------------------------------------------
# General packages
# ------------------------------------------------------------------------------------------
required.packages <- c("tidyverse", "here", "see", "ggplot2", "hablar", "kableExtra", "RColorBrewer", 
                       "cowplot", "ggpubr", "ggbeeswarm", "viridis",
# ------------------------------------------------------------------------------------------
# Parallel computing packages
# ------------------------------------------------------------------------------------------
                      "parallelMap", "parallelly", "parallel",
# ------------------------------------------------------------------------------------------
# Packages specifically for tidymodels machine learning                                                                   
# ------------------------------------------------------------------------------------------
                       "tidymodels", "treesnip","stacks", "workflowsets",            
                       "finetune", "vip", "modeldata", "finetune",  
                       "visdat", "spatialsample", "kknn",
                       "kernlab", "ranger", "xgboost", "tidyposterior"
                       )

p_load(char=required.packages, install = T,character.only = T)

Loading the ERAg package

The ERAg package includes all the ERA data we need for our ED steps.

Show code
# devtools::install_github("peetmate/ERAg", 
#                          auth_token = "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", 
#                          build_vignettes = T)

library(ERAg)

Preliminary data cleaning and data wrangling

The real power of ERA is that it is geo-referenced by latitude and longitude of where the agricultural research/study was performed. This make it exceptionally relevant for getting insights into useful relationships between any spatially explicit data and ERA practices, outcomes and (crop) products. Any data that can be spatially referenced such as weather, soil, marked access etc. can be investigated in relation to ERA data. In our case we are going to make use of two datasets, a digital soil for the continent of Africa called iSDA and a climate dataset, called BioClim adapted from WorldClim.org to have more biologically meaningful variables. The digital soil map on 30 m resolution is from the Innovative Solutions for Decision Agriculture (iSDA) data from Hengl, T., Miller, M.A.E., Križan, J. et al. (2021). Please see the websites isda-africa.com, worldagroforestry.org/isda and zenodo.org/isdasoil for more information and direct data access to this soil data. Please see the paper of Noce, S., Caporaso, L. & Santini, M. (2020) for more information on WorldClim and BioClim variables.

Merging ERA and biophysical pretictor features

Now that we are familiar with the biophysical co-variables, we are going to merging our ERA agroforestry data with the iSDA and the BioClim data. We can use the coordinate information we have from our latitude and longitude columns in our ERA agroforestry data to merge by.

Note: here that we are proceeding with the agroforestry dataset called “ERA.AF.PR.TRESH” that we created in the previous section - because this dataset is the subset of ERA.Compiled, in which we have only agroforestry data, we have filtered out all non-annual crops and included a thresholds with a minimum of 25 observations per agroforestry practice. The ERA.AF.PR.TRESH data is not the output of the ERAg::PrepareERA function, thus we have not removed data with negative outcomes, that would potentially cause issues in the machine learning analysis.

# Defining new variable of agroforestry data

# To avoid common problems of data tables we are going to use 
# the data.table::copy() function to add our ERA.AF.PR.TRESH data and define a new variable
era.agrofor <- data.table::copy(ERA.AF.PR.TRESH)

dim(era.agrofor)
#[1] 6361  217

BioClim climate co-variables

Show code
# Getting the BioClim co-variables from the ERAg package and converting the dataset to a dataframe
ERA.BioClim <- as.data.table(ERAg::ERA_BioClim)
# Changing the name of the column that contains the coordinates from "V1" to "Site.Key", so that we can use it to merge later
names(ERA.BioClim)[names(ERA.BioClim) == "V1"] <- "Site.Key" 

# ------------------------------------------------------------------------------------------------------------------------
# merging with the agroforestry.bioclim data using the common "Site.Key" coordinates and creating a new dataset

era.agrofor.bioclim <- cbind(era.agrofor, ERA.BioClim[match(era.agrofor[, Site.Key], ERA.BioClim[, Site.Key])])
dim(era.agrofor.bioclim)
    # [1] 6361  332

# rmarkdown::paged_table(era.agrofor.bioclim)

ERA Physical co-variables

Show code
# Getting some of the older the physical (ERA_Physical) co-variables from the ERAg package. 
# Note that we load in this ERA data because some of this information is not found in the iSDA data (e.g. altitude and slope)

#converting the dataset to a dataframe
ERA.Phys <- as.data.table(ERAg::ERA_Physical)

# selecting features
ERA.Phys <- ERA.Phys %>%
  dplyr::select(Site.Key, Altitude.mean, Altitude.sd, Altitude.max, Altitude.min, Slope.med, Slope.mean, Slope.sd, Slope.max,
                Slope.min, Aspect.med, Aspect.mean, Aspect.sd)

# ------------------------------------------------------------------------------------------------------------------------
# merging with the era.agrofor.bioclim data using the common "Site.Key" coordinates and creating a new dataset

era.agrofor.bioclim.phys <- cbind(era.agrofor.bioclim, ERA.Phys[match(era.agrofor.bioclim[, Site.Key], ERA.Phys[, Site.Key])])
dim(era.agrofor.bioclim.phys)
   # [1] 6361  345

iSDA soil co-variables In a similar fasion we can load the iSDA soil co-variables from the ERAg package.

Show code
# iSDA
iSDAsoils <- ERAg::ERA_ISDA

# We want to retain the soil depth information from iSDA, so we create fields for upper and lower measurement depths.
iSDAsoils[grepl("0..200cm",Variable),Depth.Upper:=0
      ][grepl("0..200cm",Variable),Depth.Lower:=200
        ][,Variable:=gsub("_m_30m_0..200cm","",Variable)]

iSDAsoils[grepl("0..20cm",Variable),Depth.Upper:=0
      ][grepl("0..20cm",Variable),Depth.Lower:=20
        ][,Variable:=gsub("_m_30m_0..20cm","",Variable)]

iSDAsoils[grepl("20..50cm",Variable),Depth.Upper:=20
      ][grepl("20..50cm",Variable),Depth.Lower:=50
        ][,Variable:=gsub("_m_30m_20..50cm","",Variable)]

iSDAsoils[,Weight:=Depth.Lower-Depth.Upper]

iSDAsoils <- unique(iSDAsoils[,list(Median=weighted.mean(Median,Weight),Mean=weighted.mean(Mean,Weight),SD=weighted.mean(SD,Weight),Mode=Mode[Weight==max(Weight)]),by=list(Variable,Site.Key)])

iSDAsoils <- dcast(iSDAsoils,Site.Key~Variable,value.var = c("Median","Mean","SD","Mode"))

ERA.iSDA <- as.data.table(iSDAsoils)

# ------------------------------------------------------------------------------------------------------------------------
# merging with the era.agrofor.bioclim.phys data using the common "Site.Key" coordinates and creating a new dataset

era.agrofor.bioclim.phys.isda <- cbind(era.agrofor.bioclim.phys, ERA.iSDA[match(era.agrofor.bioclim.phys[, Site.Key], ERA.iSDA[, Site.Key])])
dim(era.agrofor.bioclim.phys.isda)
   # [1] 6361  425
   # ([1] 6361  418)

We see that the number of features have increased dramatically. We now have 418 co-variables, also called independent variables in modelling, and features in typical machine learning language. From now on the term “feature” will be used to describe the columns including the co-variables.

Show code
era.agrofor.bioclim.phys.isda.na <- is.na(era.agrofor.bioclim.phys.isda) %>% 
  colSums() 

era.agrofor.bioclim.phys.isda.na

Removing duplicated columns

Show code
era.agrofor.bioclim.phys.isda.no.dubl <- era.agrofor.bioclim.phys.isda %>%    
  subset(., select = which(!duplicated(names(.))))

Data Cleaning

Deriving value from any machine learning-based approach crucially depends on the quality of the underlying data. There is a lot of data wrangling and pre-processing to do before this messy data is useful and ready for machine learning analysis. We are first going to perform some preliminary data cleaning as our first pre-processing step. Even though many of the important data cleaning steps are integrated into the tidymodels framework when we are going to construct our machine learning model workflows it is still important to perform the crucial data cleaning step manually prior to advancing into the more complex machine learning in order to get a better understanding of the nature of the data.

We will iterate between the data cleaning and the exploitative data analysis of the agroforestry data.

Renaming features from BioClim and iSDA Note that all variables are mean values. Keep individual units in mind!

# New ERA dataframe with all biophysical co-variables
agrofor.biophys <- data.table::copy(era.agrofor.bioclim.phys.isda.no.dubl)


dim(era.agrofor.bioclim.phys.isda.no.dubl)
  # [1] 6361  415
class(agrofor.biophys)
  # [1] "data.table" "data.frame"
dim(agrofor.biophys)
  # [1] 6361  415


agrofor.biophys.rename <- data.table::copy(agrofor.biophys) %>% 
  
# Renaming BioClim features
  rename(Bio01_MT_Annu     = wc2.1_30s_bio_1.Mean) %>%
  rename(Bio02_MDR         = wc2.1_30s_bio_2.Mean) %>%
  rename(Bio03_Iso         = wc2.1_30s_bio_3.Mean) %>%
  rename(Bio04_TS          = wc2.1_30s_bio_4.Mean) %>%
  rename(Bio05_TWM         = wc2.1_30s_bio_5.Mean) %>%
  rename(Bio06_MinTCM      = wc2.1_30s_bio_6.Mean) %>%
  rename(Bio07_TAR         = wc2.1_30s_bio_7.Mean) %>%
  rename(Bio08_MT_WetQ     = wc2.1_30s_bio_8.Mean) %>%
  rename(Bio09_MT_DryQ     = wc2.1_30s_bio_9.Mean) %>%
  rename(Bio10_MT_WarQ     = wc2.1_30s_bio_10.Mean) %>%
  rename(Bio11_MT_ColQ     = wc2.1_30s_bio_11.Mean) %>%
  rename(Bio12_Pecip_Annu  = wc2.1_30s_bio_12.Mean) %>%
  rename(Bio13_Precip_WetM = wc2.1_30s_bio_13.Mean) %>%
  rename(Bio14_Precip_DryM = wc2.1_30s_bio_14.Mean) %>%
  rename(Bio15_Precip_S    = wc2.1_30s_bio_15.Mean) %>%
  rename(Bio16_Precip_WetQ = wc2.1_30s_bio_16.Mean) %>%
  rename(Bio17_Precip_DryQ = wc2.1_30s_bio_17.Mean) %>%
  rename(Bio18_Precip_WarQ = wc2.1_30s_bio_18.Mean) %>%
  rename(Bio19_Precip_ColQ = wc2.1_30s_bio_19.Mean) %>%
  
# Renaming iSDA soil variables
  rename(iSDA_Depth_to_bedrock     = Mean_bdr) %>%
  rename(iSDA_SAND_conc            = Mean_sand_tot_psa) %>%
  rename(iSDA_SILT_conc            = Mean_silt_tot_psa) %>%
  rename(iSDA_CLAY_conc            = Mean_clay_tot_psa) %>%
  rename(iSDA_FE_Bulk_dens         = Mean_db_od) %>%
  rename(iSDA_log_C_tot            = Mean_log.c_tot) %>%
  rename(iSDA_log_Ca               = Mean_log.ca_mehlich3) %>%
  rename(iSDA_log_eCEC             = Mean_log.ecec.f) %>%
  rename(iSDA_log_Fe               = Mean_log.fe_mehlich3) %>%
  rename(iSDA_log_K                = Mean_log.k_mehlich3) %>%
  rename(iSDA_log_Mg               = Mean_log.mg_mehlich3) %>%
  rename(iSDA_log_N                = Mean_log.n_tot_ncs) %>%
  rename(iSDA_log_SOC              = Mean_log.oc) %>%
  rename(iSDA_log_P                = Mean_log.p_mehlich3) %>%
  rename(iSDA_log_S                = Mean_log.s_mehlich3) %>%
  rename(iSDA_pH                   = Mean_ph_h2o) %>%

# Renaming iSDA soil texture values based on the guide according to iSDA
  rowwise() %>%
  mutate(Texture_class_20cm_descrip =
          case_when(
            as.numeric(Mode_texture.class) == 1 ~ "Clay", 
            as.numeric(Mode_texture.class) == 2 ~ "Silty_clay",
            as.numeric(Mode_texture.class) == 3 ~ "Sandy_clay",
            as.numeric(Mode_texture.class) == 4 ~ "Loam",
            as.numeric(Mode_texture.class) == 5 ~ "Silty_clay_loam",
            as.numeric(Mode_texture.class) == 6 ~ "Sandy_clay_loam",
            as.numeric(Mode_texture.class) == 7 ~ "Loam",
            as.numeric(Mode_texture.class) == 8 ~ "Silt_loam",
            as.numeric(Mode_texture.class) == 9 ~ "Sandy_loam",
            as.numeric(Mode_texture.class) == 10 ~ "Silt",
            as.numeric(Mode_texture.class) == 11 ~ "Loam_sand",
            as.numeric(Mode_texture.class) == 12 ~ "Sand",
            as.numeric(Mode_texture.class) == 255 ~ "NODATA")) %>%
  
# Renaming the ASTER variables, slope and elevation  
  rename(ASTER_Slope               = Slope.mean) %>%
  rename(ASTER_Altitude            = Altitude.mean) %>%

# Renaming agroecological zone 16 simple
  mutate(AEZ16s = AEZ16simple) 


# Error: C stack usage  7969312 is too close to the limit

Fixing empty cells in the Tree column

Show code
agrofor.biophys.rename <- agrofor.biophys.rename %>% 
  
  #dplyr::filter_all(all_vars(!is.infinite(.))) %>%
  dplyr::mutate(across(Tree, ~ ifelse(. == "", NA, as.character(.)))) %>%
  mutate_if(is.character, factor) %>%
  dplyr::mutate(across(Tree, ~ ifelse(. == "NA", "Unknown_tree", as.character(.)))) %>% 
  dplyr::mutate(across(AEZ16s, ~ ifelse(. == "NA", "Unknown_AEZ16", as.character(.)))) %>%
  replace_na(list(Tree = "Unknown_Tree", AEZ16s = "Unknown_AEZ16s")) 

# Error: C stack usage  7969312 is too close to the limit

There are currently several site types we will change this so we only have three site types. A Site.Type called “Farm,” another called “Station” and a third one called “Other,” that will collect site types that are neither Farm nor Station. We will also remove the site type “Greenhouse,” as it does not make sense in an agroforestry setting. It remains unknown where the ERA agroforestry observations with “Greenhouse” as site type come from..

Renaming the column Site.Types

Show code
agrofor.biophys.rename <- agrofor.biophys.rename %>%
  filter(!(Site.Type == "Greenhouse")) %>%
  dplyr::mutate(
    Site_Type = case_when(
      str_detect(Site.Type, "Farm") ~ "Farm",
      str_detect(Site.Type, "Station") ~ "Station",
      TRUE ~ "Other")
    ) %>%
  mutate_if(is.character, factor)

Selecting ERA Agroforestry outcomes; Crop and Biomass Yield

For this machine learning analysis we will only focus on agroforestry data that have crop and biomass yield as outcomes. This is because it is not meaningful to compare response ratios across the diverse set of ERA outcomes as the basis for calculating MeanC and MeanT might vary tremendously. Hence due to that fact we are subsetting on Out.SubInd for crop and biomass yield.

agrofor.biophys.outcome.yield <- agrofor.biophys.rename %>%
  dplyr::filter(Out.SubInd == "Crop Yield" | Out.SubInd == "Biomass Yield")

Selecting features for modelling and calculating outcome variable, response ratios

we are going to select the features for our modelling and machine learning exercise. We are selecting MeanC and MeanT as we will convert these to one outcome feature.

Selecting feature columns for modelling

agrofor.biophys.selected.features <- data.table::copy(agrofor.biophys.outcome.yield) %>%
  
  dplyr::select(MeanC,             # Outcome variable(s)
                MeanT,
                
                ID,                # ID variable (will not be included as predictor)
                AEZ16s,            # ID variable (will not be included as predictor)
                PrName,            # ID variable (will not be included as predictor)
                PrName.Code,       # ID variable (will not be included as predictor)
                SubPrName,         # ID variable (will not be included as predictor)
                Product,           # ID variable (will not be included as predictor)
                Out.SubInd,        # ID variable (will not be included as predictor)
                Out.SubInd.Code,   # ID variable (will not be included as predictor)
                Country,           # ID variable (will not be included as predictor)
                
                Latitude,          # ID variable (will not be included as predictor, but as a basis for the spatial/cluster cross-validation)
                Longitude,         # ID variable (will not be included as predictor, but as a basis for the spatial/cluster cross-validation)
                
                Site.Type,         # Site_Type will be included as predictor, to see how much it explains the outcome
                Tree,              # Tree will be included as predictor, to see how much it explains the outcome
                
                Bio01_MT_Annu,
                Bio02_MDR,
                Bio03_Iso, 
                Bio04_TS,
                Bio05_TWM,
                Bio06_MinTCM,
                Bio07_TAR, 
                Bio08_MT_WetQ,
                Bio09_MT_DryQ,
                Bio10_MT_WarQ,
                Bio11_MT_ColQ,
                Bio12_Pecip_Annu,
                Bio13_Precip_WetM,
                Bio14_Precip_DryM,
                Bio15_Precip_S,
                Bio16_Precip_WetQ,
                Bio17_Precip_DryQ,
                
                iSDA_Depth_to_bedrock,
                iSDA_SAND_conc,
                iSDA_CLAY_conc,
                iSDA_SILT_conc,
                iSDA_FE_Bulk_dens,
                iSDA_log_C_tot,
                iSDA_log_Ca,
                iSDA_log_eCEC,
                iSDA_log_Fe,
                iSDA_log_K,
                iSDA_log_Mg,
                iSDA_log_N,
                iSDA_log_SOC,
                iSDA_log_P,
                iSDA_log_S,
                iSDA_pH,
            
                ASTER_Altitude,                          
                ASTER_Slope
              ) %>%
  relocate(ID, PrName, Out.SubInd, Product, AEZ16s, Country, Site.Type, MeanC, MeanT, Tree)



dim(agrofor.biophys.selected.features)
  # [1] 4578   48
class(agrofor.biophys.selected.features)
  # [1] "rowwise_df" "tbl_df"     "tbl"        "data.frame"

Calculating response ratios for each observation We are going to compute a simple response ratio as RR = (MeanT/MeanC) and a log-transformed response ratio as logRR = (log(MeanT/MeanC)). The later will be used as our outcome variable.

agrofor.biophys.modelling.data <- data.table::copy(agrofor.biophys.selected.features) %>% 
  #subset(., select = which(!duplicated(names(.)))) %>%                   # First, removing duplicated columns
  dplyr::mutate(RR = (MeanT/MeanC)) %>%
  dplyr::mutate(logRR = (log(MeanT/MeanC)))

dim(agrofor.biophys.modelling.data)
   # [1] 4578   52

Lets have a look at our modelling data with selected features and our outcome, calculated response ratios

Show code
agrofor.biophys.modelling.data <- readRDS(file = here::here("TidyMod_OUTPUT","agrofor.biophys.modelling.data.RDS"))

rmarkdown::paged_table(agrofor.biophys.modelling.data)

Looks pretty good, however we have many empty cells for the Tree column that needs to be fixed. We have a data.frame (rowwise_df, tbl_df, tbl, data.frame) with 4578 observations and 52 features. Four of these features are our MeanC and MeanT that we will use to calculate the response ratio (RR and logRR). The RR and/or logRR will be our outcome feature, also called response variable. Hence, we have 52 - (4 + 11) = 37 predictors of which 35 are numeric and two (Tree and Site.Type are categorical/factors).

Creating data with only numeric predictors

A separate dataset with only numeric predictors and our outcome (RR and logRR) can be helpful when performing some important exploitative data analysis.

agrofor.biophys.modelling.data.numeric <- data.table::copy(agrofor.biophys.modelling.data) %>%
  dplyr::select(RR,
                logRR,
                
                Bio01_MT_Annu,
                Bio02_MDR,
                Bio03_Iso, 
                Bio04_TS,
                Bio05_TWM,
                Bio06_MinTCM,
                Bio07_TAR, 
                Bio08_MT_WetQ,
                Bio09_MT_DryQ,
                Bio10_MT_WarQ,
                Bio11_MT_ColQ,
                Bio12_Pecip_Annu,
                Bio13_Precip_WetM,
                Bio14_Precip_DryM,
                Bio15_Precip_S,
                Bio16_Precip_WetQ,
                Bio17_Precip_DryQ,
                
                iSDA_Depth_to_bedrock,
                iSDA_SAND_conc,
                iSDA_CLAY_conc,
                iSDA_SILT_conc,
                iSDA_FE_Bulk_dens,
                iSDA_log_C_tot,
                iSDA_log_Ca,
                iSDA_log_eCEC,
                iSDA_log_Fe,
                iSDA_log_K,
                iSDA_log_Mg,
                iSDA_log_N,
                iSDA_log_SOC,
                iSDA_log_P,
                iSDA_log_S,
                iSDA_pH,
            
                ASTER_Altitude,                          
                ASTER_Slope
                ) 

Explorotive Data Analysis

We are going to perform some exploratory data analysis (EDA) of the ERA agroforestry data. EDA is a critical pre-modelling step in order to make initial investigations so as to discover patterns, to spot anomalies, test hypothesis and to check assumptions. We are going to do this on the agroforestry data with the help of visualisations and summary statistics. Performing exploratory data analysis prior to conducting any machine learning is always a good idea, as it will help to get an understanding of the data in terms of relationships and how different predictor features are distributed, how they relate to the outcome (logRR) and how they relate to each other. Explorative data analysis is often an iterative process consisting of posing a question to the data, review the data, and develop further questions to investigate it before beginning the modelling and machine learning development. Hence, we need to conduct a proper EDA of our agroforestry data as it is fundamental to our later machine learning-based modelling analysis. We will first visually explore our agroforestry data using the DataExplorer package to generate histograms, density plots and scatter plots of the biophysical predictors. We will examine the results critically with our later machine learning approach in mind. We will continue the EDA by looking at the correlations between our biophysical predictors - and between our predictors and the outcome (logRR) to see if we can make any initial pattern recognition manually. Overall we will use the outcome of our EDA to identify required pre-processing steps needed for our machine learning models (e.g. excluding highly correlated predictors or normalizing some features for better model performance).

We will go through these nine steps in the EDA:

Lets first familiarise ourself with the agroforestry data using the functions introduce() and plot_intro() from DataExplorer package.

Show code
DataExplorer::introduce(agrofor.biophys.modelling.data) %>% glimpse()
Rows: 1
Columns: 9
Rowwise: 
$ rows                 <int> 4578
$ columns              <int> 52
$ discrete_columns     <int> 10
$ continuous_columns   <int> 42
$ all_missing_columns  <int> 0
$ total_missing_values <int> 2436
$ complete_rows        <int> 4473
$ total_observations   <int> 238056
$ memory_usage         <dbl> 1783016
Show code
DataExplorer::plot_intro(agrofor.biophys.modelling.data)
Re-introducing the agroforestry data

(#fig:Re-introduction to the agroforestry data plot)Re-introducing the agroforestry data

Missing data exploration

As part of the EDA of data cleaning we would need to deal with missing data in our agroforestry data. To get a better understanding of our missing data, such as “where it is” and how it manifest it self. We are going to make use of the naniar package, that is excellent for familiarixing our selfs with the missingness of the data.

First, lets view the number and percentage of missing values in each combination of grouped levels of practices (PrName) and outcomes (Out.SubInd).

Show code
agrofor.biophys.modelling.data.missing <- agrofor.biophys.modelling.data %>%
  dplyr::group_by(PrName, Out.SubInd, AEZ16s) %>%
  naniar::miss_var_summary() %>%
  arrange(desc(n_miss))
Show code
rmarkdown::paged_table(agrofor.biophys.modelling.data.missing)

Now we can visually assess where there is missing data. Lets also make one where we facet based on agroecological zones as it is interesting to see in which agroecological zones we miss data.

agrofor.biophys.gg_miss_var.all <- naniar::gg_miss_var(agrofor.biophys.modelling.data,
                    show_pct = TRUE,
                    facet = "none") 

agrofor.biophys.gg_miss_var.AEZ <- naniar::gg_miss_var(agrofor.biophys.modelling.data,
                    show_pct = TRUE,
                    facet = AEZ16s) 

# Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.
Show code
agrofor.biophys.gg_miss_var.combined <- cowplot::plot_grid(agrofor.biophys.gg_miss_var.all,
                                                           agrofor.biophys.gg_miss_var.AEZ)

agrofor.biophys.gg_miss_var.combined
Visually assess missing data

(#fig:Visually assess where we have missing data)Visually assess missing data

Show code
# Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.

We see that most of data is quite complete however there are some spatially explicit patters for iSDA and WorldClim (BioClim) data. The variable/feature “Tree (species)” is the feature where most data is missing, yet it is only around 20 % of this data we do not have in our agroforestry data. In the faceted plot (by: agroecological zone), we see that there is a great deal of overlap between where we have missing quantitative data, from either iSDA or WorldClim and the missing qualitative information on agroecological zone - again highlighting that some areas/ERA observations are from areas outside the boundary of the co-variables data have been sourced.

Lets look accros the site types (farm, research station and survey) whether we see any differences in missingness of data. To do this we are using the geom_miss_point() function from the naniar package. We are going to plot a random iSDA feature agains a random BioClim feature - as we now know that if there are missing data from one iSDA (or BioClim) feature it is missing in all the features deriving from the respected data source.

agrofor.biophys.selected.features <- readRDS(file = here::here("TidyMod_OUTPUT","agrofor.biophys.selected.features.RDS"))

miss.check.site <-
ggplot(data = agrofor.biophys.selected.features,
       aes(x = Bio09_MT_DryQ,
           y = iSDA_CLAY_conc)) +
  naniar::geom_miss_point() +
  facet_wrap(~Site.Type, ncol = 2, scales = "free_x") + 
  coord_cartesian(ylim = c(-1,60)) +
  theme(legend.position = "bottom")

miss.check.country <-
ggplot(data = agrofor.biophys.selected.features,
       aes(x = Bio09_MT_DryQ,
           y = iSDA_CLAY_conc)) +
  naniar::geom_miss_point() +
  facet_wrap(~Country, ncol = 2, scales = "free_x") + 
  coord_cartesian(ylim = c(-1,60)) +
  theme(legend.position = "bottom") 
Show code
miss.check.site
Missing data accross site type

(#fig:Plotting differences in missing data distribution accross site types)Missing data accross site type

..and what about Countries? Do we see differences in the distribution of missing data across Countries?

Show code
miss.check.country
Missing data accross Countries

(#fig:Plotting differences in missing data distribution accross contries)Missing data accross Countries

We see that missing data is present only from research sites on farms. We can use the bind_shadow() and nabular() functions from naniar to keep better track of the missing values in our agroforestry data. We also see that Kenya and Zimbabwe are the two major \(sources\) of missing data. This is strange as they are not located outside the countinent.

If we use the nabular table format it helps you manage where missing values are in the dataset and makes it easy to do visualisations where when we split by missingness. Lets look at the missing data overlap between the two co-variables “Bio09_MT_DryQ” and “iSDA_CLAY_conc,” and then visualise a density plot.

Show code
agrofor.biophys.modelling.data.nabular <- naniar::nabular(agrofor.biophys.selected.features)
rmarkdown::paged_table(agrofor.biophys.modelling.data.nabular)

Using the naniar function called bind_shadow() we can generate a visual showing the areas of our agroforestry data where we have missing data. Lets make such a shaddow plot:

Show code
agrofor.biophys.modelling.data.numeric <- readRDS(file = here::here("TidyMod_OUTPUT", "agrofor.biophys.modelling.data.numeric.RDS"))

agrofor.biophys.modelling.data.numeric.shadow <- naniar::bind_shadow(agrofor.biophys.modelling.data.numeric)

# Bio01_MT_Annu
agrofor.biophys.shadow.bind.plot.Bio01 <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = Bio01_MT_Annu_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "BIO.01 - Mean annual temperature")

# Bio02_MDR
agrofor.biophys.shadow.bind.plot.Bio02 <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = Bio02_MDR_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "BIO.02 - Mean diurnal range")


# Bio03_Iso
agrofor.biophys.shadow.bind.plot.Bio03 <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = Bio03_Iso_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "BIO.03 - Isothermality")

# Bio07_TAR
agrofor.biophys.shadow.bind.plot.Bio07 <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = Bio07_TAR_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "BIO.07 - Temperature annual range")

# Bio12_Pecip_Annu
agrofor.biophys.shadow.bind.plot.Bio12 <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = Bio12_Pecip_Annu_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "BIO.12 - Annual precipitation")
           
# iSDA_SAND_conc
agrofor.biophys.shadow.bind.plot.sand <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = iSDA_SAND_conc_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "iSDA - Sand concentration")

# iSDA_log_C_tot
agrofor.biophys.shadow.bind.plot.totcarbon <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = iSDA_log_C_tot_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "iSDA - total carbon")

# iSDA_pH
agrofor.biophys.shadow.bind.plot.pH <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = iSDA_pH_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "iSDA - pH")

# iSDA_log_N
agrofor.biophys.shadow.bind.plot.N <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = iSDA_log_N_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "iSDA - Nitrogen")

# ASTER_Slope
agrofor.biophys.shadow.bind.plot.slope <- agrofor.biophys.modelling.data.numeric.shadow %>%
  ggplot(aes(x = logRR, 
             fill = ASTER_Slope_NA)) +
  geom_density(col = "gray", alpha = 0.5, size = 1) +
  scale_fill_manual(aesthetics = "fill", values = c("#A2AC8A", "#FED82F")) +
  theme_lucid(legend.title.size = 0) +
  labs(subtitle = "ASTER - Slope")

# Extract a legend theme from the yplot that is laid out horizontally
agrofor.biophys.shadow.legend <- get_legend(agrofor.biophys.shadow.bind.plot.Bio01 +
    guides(color = guide_legend(nrow = 1)) +
    theme_lucid(legend.position = "bottom", legend.title.size = 0))

# Cleaning the yplot for any style or theme 

agrofor.biophys.shadow.bind.plot.Bio01 <- agrofor.biophys.shadow.bind.plot.Bio01 + clean_theme() + rremove("legend") 
agrofor.biophys.shadow.bind.plot.Bio02 <- agrofor.biophys.shadow.bind.plot.Bio02 + clean_theme() + rremove("legend") 
agrofor.biophys.shadow.bind.plot.Bio03 <- agrofor.biophys.shadow.bind.plot.Bio03 + clean_theme() + rremove("legend") 
agrofor.biophys.shadow.bind.plot.Bio07 <- agrofor.biophys.shadow.bind.plot.Bio07 + clean_theme() + rremove("legend") 
agrofor.biophys.shadow.bind.plot.Bio12 <- agrofor.biophys.shadow.bind.plot.Bio12 + clean_theme() + rremove("legend") 

agrofor.biophys.shadow.bind.plot.sand <- agrofor.biophys.shadow.bind.plot.sand + clean_theme() + rremove("legend") 
agrofor.biophys.shadow.bind.plot.pH <- agrofor.biophys.shadow.bind.plot.pH + clean_theme() + rremove("legend") 
agrofor.biophys.shadow.bind.plot.N <- agrofor.biophys.shadow.bind.plot.N + clean_theme() + rremove("legend") 
agrofor.biophys.shadow.bind.plot.totcarbon <- agrofor.biophys.shadow.bind.plot.totcarbon + clean_theme() + rremove("legend") 

agrofor.biophys.shadow.bind.plot.slope <- agrofor.biophys.shadow.bind.plot.slope + clean_theme() + rremove("legend") 

# Creating the title
    # add margin on the left of the drawing canvas,
    # so title is aligned with left edge of first plot
agrofor.biophys.shadow.title <- ggdraw() + 
  draw_label(
    "Combined shaddow density plot of missing value distribution of selected biophysical predictors vs. logRR",
    fontface = 'bold',
    x = 0,
    hjust = 0,
    size = 20) + 
  theme(plot.margin = margin(0, 0, 0, 7))


# Warning: Removed 15 rows containing non-finite values (stat_density).

Lets combine these in a cowplot

Show code
# Arranging the plots and using cowplot::plot_grid() to plot the three plots together
combined.agrofor.biophys.shadow <- cowplot::plot_grid(
  agrofor.biophys.shadow.title,
  NULL,
  NULL,
#########
  agrofor.biophys.shadow.bind.plot.Bio01,
  agrofor.biophys.shadow.bind.plot.Bio02,
  agrofor.biophys.shadow.bind.plot.Bio03,
  agrofor.biophys.shadow.bind.plot.Bio07,
  agrofor.biophys.shadow.bind.plot.Bio12,
  agrofor.biophys.shadow.bind.plot.sand,
  agrofor.biophys.shadow.bind.plot.pH,
  agrofor.biophys.shadow.bind.plot.N,
  agrofor.biophys.shadow.bind.plot.totcarbon,
  agrofor.biophys.shadow.bind.plot.slope,
#########
  agrofor.biophys.shadow.legend,
#########
  ncol = 3, nrow = 5, align = "hv",
  rel_widths = c(1, 1, 1), 
  rel_heights = c(2, 2, 2, 2))

combined.agrofor.biophys.shadow
Shaddow plot of selected biophysical predictor features

Figure 1: Shaddow plot of selected biophysical predictor features

Show code
# Warning: Removed 15 rows containing non-finite values (stat_density).
# Warning: Graphs cannot be vertically aligned unless the axis parameter is set. Placing graphs unaligned.
# Warning: Graphs cannot be horizontally aligned unless the axis parameter is set. Placing graphs unaligned.

We see again that there is a spatial relation between the missingness of our data, as the pattern is similar for the features extracted from the same data sources. Therefore there must be some areas where we have no biophysical information. This is perhaps because the biophysical co-variables are extracted from a fixed boundary of the African continent and islands such as Capo Verde is situated outside the continent.

Density distributions of features

Until this point we have identified and familiarised ourself with the missing data. We can now make a clean numeric dataset in which we remove all NA observations. This will be helpful for the further process of EDA. For a more aesthetic experience we are also making a custom colour range of 35 colours that we can use to plot our 35 predictors with.

expl.agfor.data.no.na <- agrofor.biophys.modelling.data.numeric %>%
  rationalize() %>%
  drop_na()

dim(agrofor.biophys.modelling.data.numeric) # <- before removing NAs
[1] 4578   37
  # [1] 4578   37
dim(expl.agfor.data.no.na)                  # <- after removing NAs
[1] 4461   37
  # [1] 4461   37

# Colour range with 35 levels
nb.cols.35 <- 35
era.af.colors.35 <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols.35)
Show code
expl.agfor.dens.pred <- readRDS(file = here::here("TidyMod_OUTPUT", "expl.agfor.dens.pred.RDS"))

expl.agfor.dens.pred
$page_1
Density distributions of biophysical predictors

Figure 2: Density distributions of biophysical predictors

Not surprising we see a lot of variation in the density distribution for our biophysical predictors. Some of them shows a multimodal distribution patterns, e.g. altitude and Bio06 the minimum temperature of coldest month, Bio03 the Isothermality and iSDA pH shows a bimodal density distribution pattern, while others, such as Bio05 maximum temperature of warmest month, iSDA Iron concentrations and iSDA total carbon are skewed to the left, right and right, respectfully. Others, like Bio09 mean temperature of driest quarter and iSDA phosphorus concentrations seems to be having a good similarity of a normal distribution. This and other observations gives us a good understanding about the predictors. One take-away massage would be that we clearly need some sort of normalisation and centralisation of the predictors prior to our machine learning exercise in order to get good performance.

Show code
expl.agfor.dens.outcome <- expl.agfor.data.no.na %>%
  dplyr::select(c("RR", "logRR")) %>% 
  plot_density(ggtheme = theme_lucid(),
                 geom_density_args = list("fill" = "#FED82F", "col" = "gray", "alpha" = 0.5, "size" = 2, "adjust" = 0.5),
                 ncol = 2, nrow = 1, 
                 parallel = FALSE,
                 title = "Density distributions of the outcome logRR or RR"
                 ) 
Show code
expl.agfor.dens.outcome <- readRDS(file = here::here("TidyMod_OUTPUT", "expl.agfor.dens.outcome.RDS"))

expl.agfor.dens.outcome
$page_1
Density distributions logRR

(#fig:logRR density distrubution)Density distributions logRR

It is not only the destribution of predictors that matter when doing machine learning modelling, and in order to get a good performance we also need to check for normality in our outcome feature as they will be difficult for the algorithm to predict if we have too abnormall patterns. So lets view the density distribution for our outcome feature(s) logRR and RR.

This clearly demonstrate that the log transformed RR is a much better outcome feature, compared to the raw RR. We might want to exclude extreme values from our logRR on both the positive and negative tail of the distribution. We might do this using the extreme outlier removal approach already used in the ERAAnalyze function, where response ratios above or below 3∗IQR (interquartile range) are removed.

Show code
rmarkdown::paged_table(expl.agfor.data.no.na.logRR.outlier %>% count(logRR.outlier, sort = TRUE))

Based on the extreme outliers removal method we have 40 outliers with 28 of them being < 0 and 12 being > 0. Lets assess these outliers visually by making a nice violin and box plot of the outcome variable logRR using ggplot2 with the ggbeeswarm package. We will include the outliers indicated in red within the 3 * QR limit.

Show code
ggplot_box_legend <- function(family = "serif"){

  # Create data to use in the boxplot legend:
  set.seed(100)

  sample_df <- data.frame(parameter = "test",
                        values = sample(500))

  # Extend the top whisker a bit:
  sample_df$values[1:100] <- 701:800
  # Make sure there's only 1 lower outlier:
  sample_df$values[1] <- -350

  # Function to calculate important values:
  ggplot2_boxplot <- function(x){

    quartiles <- as.numeric(quantile(x,
                                     probs = c(0.25, 0.5, 0.75)))

    names(quartiles) <- c("25th percentile",
                          "50th percentile\n(median)",
                          "75th percentile")

    IQR <- diff(quartiles[c(1,3)])

    upper_whisker <- max(x[x < (quartiles[3] + 1.5 * IQR)])
    lower_whisker <- min(x[x > (quartiles[1] - 1.5 * IQR)])

    upper_dots <- x[x > (quartiles[3] + 1.5*IQR)]
    lower_dots <- x[x < (quartiles[1] - 1.5*IQR)]

    return(list("quartiles" = quartiles,
                "25th percentile" = as.numeric(quartiles[1]),
                "50th percentile\n(median)" = as.numeric(quartiles[2]),
                "75th percentile" = as.numeric(quartiles[3]),
                "IQR" = IQR,
                "upper_whisker" = upper_whisker,
                "lower_whisker" = lower_whisker,
                "upper_dots" = upper_dots,
                "lower_dots" = lower_dots))
  }

  # Get those values:
  ggplot_output <- ggplot2_boxplot(sample_df$values)

  # Lots of text in the legend, make it smaller and consistent font:
  update_geom_defaults("text",
                     list(size = 3,
                          hjust = 0,
                          family = family))
  # Labels don't inherit text:
  update_geom_defaults("label",
                     list(size = 3,
                          hjust = 0,
                          family = family))

  # Create the legend:
  # The main elements of the plot (the boxplot, error bars, and count)
  # are the easy part.
  # The text describing each of those takes a lot of fiddling to
  # get the location and style just right:
  explain_plot <- ggplot() +
            geom_violin(data = sample_df,
                 aes(x = parameter, y = values),
                alpha = 0.5, 
                size = 4, 
                fill = "#FED82F", 
                colour = "white",
                width = 0.3) +
    geom_boxplot(data = sample_df,
                 aes(x = parameter, y = values),
                 fill = "white",
                 col = "black",
                 notch = TRUE,  
                 outlier.size = 0.3, 
                 lwd = 1, 
                 alpha = 0.5, 
                 outlier.colour = "red",
                 outlier.fill = "black", 
                 show.legend = F, 
                 outlier.shape = 10) +
    stat_boxplot(data = sample_df,
                 aes(x = parameter, y = values),
                 alpha = 0.3,
                 size = 1,
                 geom = "errorbar", 
                 colour = "black",
                 linetype = 1, 
                 width = 0.3) +  #whiskers+
    
    geom_text(aes(x = 1, y = 950, label = ""), hjust = 0.5) +
    geom_text(aes(x = 1.17, y = 950,
                  label = ""),
              fontface = "bold", vjust = 0.4) +
    theme_minimal(base_size = 5, base_family = family) +
    geom_segment(aes(x = 2.3, xend = 2.3,
                     y = ggplot_output[["25th percentile"]],
                     yend = ggplot_output[["75th percentile"]])) +
    geom_segment(aes(x = 1.2, xend = 2.3,
                     y = ggplot_output[["25th percentile"]],
                     yend = ggplot_output[["25th percentile"]])) +
    geom_segment(aes(x = 1.2, xend = 2.3,
                     y = ggplot_output[["75th percentile"]],
                     yend = ggplot_output[["75th percentile"]])) +
    geom_text(aes(x = 2.4, y = ggplot_output[["50th percentile\n(median)"]]),
              label = "Interquartile\nrange", fontface = "bold",
              vjust = 0.4) +
    geom_text(aes(x = c(1.17,1.17),
                  y = c(ggplot_output[["upper_whisker"]],
                        ggplot_output[["lower_whisker"]]),
                  label = c("Largest value within 3 times\ninterquartile range above\n75th percentile",
                            "Smallest value within 3 times\ninterquartile range below\n25th percentile")),
                  fontface = "bold", vjust = 0.9) +
    geom_text(aes(x = c(1.17),
                  y =  ggplot_output[["lower_dots"]],
                  label = "Outside value"),
              vjust = 0.5, fontface = "bold") +
    geom_text(aes(x = c(1.9),
                  y =  ggplot_output[["lower_dots"]],
                  label = "-Value is > and/or < 3 times"),
              vjust = 0.5) +
    geom_text(aes(x = 1.17,
                  y = ggplot_output[["lower_dots"]],
                  label = "the interquartile range\nbeyond either end of the box"),
              vjust = 1.5) +
    geom_label(aes(x = 1.17, y = ggplot_output[["quartiles"]],
                  label = names(ggplot_output[["quartiles"]])),
              vjust = c(0.4,0.85,0.4),
              fill = "white", label.size = 0) +
    ylab("") + xlab("") +
    theme(axis.text = element_blank(),
          axis.ticks = element_blank(),
          panel.grid = element_blank(),
          aspect.ratio = 4/3,
          plot.title = element_text(hjust = 0.5, size = 10)) +
    coord_cartesian(xlim = c(1.4,3.1), ylim = c(-600, 900)) +
    labs(title = "EXPLANATION")

  return(explain_plot)

}

Creating logRR outliers plot

Show code
legend_plot <- ggplot_box_legend()

 logRR.outlier.plot <- 
   ggplot(data = expl.agfor.data.no.na.logRR.outlier, aes(y = logRR, x = "ERA_Agroforestry")) + 
       scale_fill_viridis_d( option = "D") + 
       geom_violin(alpha = 0.5, position = position_dodge(width = 1), size = 1, fill = "#FED82F", colour = "white") +
       geom_boxplot(notch = TRUE,  outlier.size = 0.3, color = "black", lwd = 1, alpha = 0.5, outlier.colour = "red", 
                    outlier.fill = "black", show.legend = F, outlier.shape = 10) +
       scale_y_continuous(breaks =seq(-5, 5, 0.5), limit = c(-6, 6)) +
       stat_boxplot(geom ='errorbar', width = 0.15, coef = 3) +
       geom_text(aes(label = logRR.outlier), na.rm = TRUE, hjust = -0.3) + 
      # geom_point( shape = 21,size=2, position = position_jitterdodge(), color="black",alpha=1)+
       ggbeeswarm::geom_quasirandom(shape = 21, size = 1, dodge.width = 0.5, color = "black", alpha = 0.3, show.legend = F) +
       theme_minimal() +
       ylab(c("logRR"))  +
       xlab("All ERA Agroforestry Data")  +
       theme(#panel.border = element_rect(colour = "black", fill=NA, size=2),
             axis.line = element_line(colour = "black" , size = 1),
             axis.ticks = element_line(size = 1, color = "black"),
             axis.text = element_text(color = "black"),
             axis.ticks.length = unit(0.5, "cm"),
             legend.position = "none") +
       font("xylab", size = 15) +  
       font("xy", size = 15) + 
       font("xy.text", size = 15) +  
       font("legend.text", size = 15) +
       guides(fill = guide_legend(override.aes = list(alpha = 0.3, color = "black")))
 
logRR.outlier.plot.title <- ggdraw() + 
  draw_label(
    "ERA agroforestry logRR violin and boxplot with outliers",
    fontface = 'bold',
    x = 0,
    hjust = 0,
    size = 20) + 
  theme(plot.margin = margin(0, 0, 0, 7))
Show code
logRR.outlier.plot.explained <- plot_grid(
  logRR.outlier.plot.title,
          NULL,
          NULL,
          logRR.outlier.plot,
          legend_plot,
          nrow = 2, 
          ncol = 3,
          rel_widths = c(8, 6, 4),
          rel_heights = c(1, 4, 6))
Show code
logRR.outlier.plot.explained <- readRDS(file = here::here("TidyMod_OUTPUT", "logRR.outlier.plot.explained.RDS"))

logRR.outlier.plot.explained
logRR outliers plot

Figure 3: logRR outliers plot

We see that there are some logRR observation we would have to remove in order for our machine learning models to perform well!

Histograms of features

Lets now look at the histograms of our biophysical predictors. This is slightly similar to the density plots, but with the histograms we are able to see how the data is distributed in bins of 30 observations each. We are using the plot_histogram() from the DataExplorer package.

Show code
expl.agfor.hist <- expl.agfor.data.no.na %>%
  dplyr::select(-c("RR", "logRR")) %>% 
  plot_histogram(ggtheme = theme_lucid(),
                 geom_histogram_args = list(fill = (explor.cols.group.35_1050), bins = 30L),
                 ncol = 6, nrow = 6, 
                 parallel = FALSE,
                 title = "hist"
                 ) 
Show code
expl.agfor.hist <- readRDS(file = here::here("TidyMod_OUTPUT", "expl.agfor.hist.RDS"))

expl.agfor.hist
$page_1
Density distributions logRR

Figure 4: Density distributions logRR

Correlation of features

We can use the the lares package is an R package built to automate, improve, and speed everyday analysis and machine learning tasks, build specifically for visualising regression and classification model performance by easily and intuitively giving most important plots, all generated at once. This function creates a correlation full study and returns a rank of the highest correlation variables obtained in a cross-table. Depending on input plot, we get correlation and p-value results for every combination of features, arranged by descending absolute correlation value.

Show code
agrofor.biophys.modelling.data.lares.corr <- readRDS(file = here::here("TidyMod_OUTPUT","agrofor.biophys.modelling.data.RDS"))

agrofor.biophys.modelling.data.lares.corr <- 
  agrofor.biophys.modelling.data.lares.corr %>%
  dplyr::select(-c("RR", "ID", "AEZ16s", "Country", "MeanC", "MeanT", "PrName", "PrName.Code", "ID", "SubPrName", "Out.SubInd", "Product", "Product", "Country", "Site.Type", "Tree", "Out.SubInd.Code")) 
 
# Show only most relevant results filtered by pvalue
lares::corr_cross(agrofor.biophys.modelling.data.lares.corr,
                  rm.na = TRUE,
                  max_pvalue = 0.05, 
                  top = 20,
                  quiet = TRUE)
Ranked cross-correlation plot of numeric biophysical co-variables

Figure 5: Ranked cross-correlation plot of numeric biophysical co-variables

There is another kind of cross-correlation that returns all correlations in a single plot, not necessarily ranked. This will help us understand the skewness or randomness of some correlations found. It will also highlight the highest correlations for each of the variables used.

Show code
# Cross-Correlation max values per category
lares::corr_cross(agrofor.biophys.modelling.data.lares.corr, type = 2, top = NA)
Ranked cross-correlation plot with all correlations of numeric biophysical co-variables

Figure 6: Ranked cross-correlation plot with all correlations of numeric biophysical co-variables

Tidymodels workflow steps

Lets not continue with the actual machine learning using the tidymodels framework. The tidymodels machine learning framework will be used to generate a mixture of regression models in order to generate a final ensemble stacked model in order to improve performance and prediction power.

Now we are fully set up to follow the tidymodels machine learning framework workflow:

  1. Initial data cleaning and outlier removal if necessary

  2. Splitting data, defining resampling techniques and setting global model metrics

  3. Model recipes - defining key pre-processing steps on data

  4. Setting model specifications

  5. Defining model workflows

  6. Model (hyper)-parameter tuning

  7. Evaluating model performances

  8. Selecting best performing models - based on best (hyper)-parameters

  9. Finalizing model workflow with best performing model (hyper)-parameter settings

  10. Perform last model fitting using all data (training + testing)

  11. Model validation/evaluation and performance, e.g. variable importance

  12. Creating final Ensemble Stacked Model - with a selection of different high-performing models

Show code
agrofor.biophys.modelling.data <- readRDS(file = here::here("TidyMod_OUTPUT","agrofor.biophys.modelling.data.RDS"))

ml.data <-  agrofor.biophys.modelling.data %>%
  dplyr::select(-c("RR", "ID", "AEZ16s", "Country", "MeanC", "MeanT", "PrName.Code", "SubPrName"))

Removing outliers from logRR

Agroforestry data with no outliers

ml.data.no.outliers <-  ml.data.outliers %>%
  dplyr::filter(logRR.outlier == 9999)

dim(ml.data.outliers)
[1] 4563   46
  #  [1] 4563   46
dim(ml.data.no.outliers)
[1] 4519   46
  # [1] 4519   46

Identifying missing values

Show code
# Note that there is missing data that is not being handled
# Can handle with step_knnimpute, step_mode, etc. (will use step_mode())
colSums(is.na(ml.data.no.outliers))  %>% 
  kbl() %>%
  kable_paper() %>%
  scroll_box(width = "100%", height = "400px")
x
PrName 0
Out.SubInd 0
Product 0
Site.Type 0
Tree 0
Out.SubInd.Code 0
Latitude 0
Longitude 0
Bio01_MT_Annu 42
Bio02_MDR 42
Bio03_Iso 42
Bio04_TS 42
Bio05_TWM 42
Bio06_MinTCM 42
Bio07_TAR 42
Bio08_MT_WetQ 42
Bio09_MT_DryQ 42
Bio10_MT_WarQ 42
Bio11_MT_ColQ 42
Bio12_Pecip_Annu 42
Bio13_Precip_WetM 42
Bio14_Precip_DryM 42
Bio15_Precip_S 42
Bio16_Precip_WetQ 42
Bio17_Precip_DryQ 42
iSDA_Depth_to_bedrock 101
iSDA_SAND_conc 101
iSDA_CLAY_conc 101
iSDA_SILT_conc 101
iSDA_FE_Bulk_dens 101
iSDA_log_C_tot 101
iSDA_log_Ca 101
iSDA_log_eCEC 101
iSDA_log_Fe 101
iSDA_log_K 101
iSDA_log_Mg 101
iSDA_log_N 101
iSDA_log_SOC 101
iSDA_log_P 101
iSDA_log_S 101
iSDA_pH 101
ASTER_Altitude 42
ASTER_Slope 42
logRR 0
ERA_Agroforestry 0
logRR.outlier 0
Show code
ml.data.clean <- data.table::copy(ml.data.no.outliers)

Removing NAs from the data

Show code
ml.data.clean <- data.table::copy(ml.data.clean) %>%
  drop_na()

STEP 1: Splitting data, defining resampling techniques and setting global model metrics

Splitting data in training and testing sets

set.seed(234)

# Splitting data
af.split <- initial_split(ml.data.clean, prop = 0.80, strata = logRR)

af.train <- training(af.split)
af.test <- testing(af.split)

Defining resampling techniques

# Re-sample technique(s)
boostrap_df <- bootstraps(af.train, times = 10, strata = logRR)
cv_fold <- vfold_cv(af.train, v = 10) # repeats = 10
spatial_cv_fold <- spatial_clustering_cv(af.train, coords = c("Longitude", "Latitude"), v = 10)

Setting global modelling metrics

# Metrics
multi.metric <- metric_set(yardstick::rmse, yardstick::rsq, yardstick::ccc, yardstick::mae)

model.control <- control_stack_grid() # save_pred = TRUE, save_workflow = TRUE.
model.control.linear <- control_resamples(save_pred = TRUE)

Setting (hyper)-parameter tuning grids

# --------------------------------------------------------------------------------------------------------------------
# Random forest model pre-defined fixed grid
rf_fixed_grid <- grid_latin_hypercube(
  min_n(),
  trees(),
  finalize(mtry(), 
           af.train),
  size = 20)

# --------------------------------------------------------------------------------------------------------------------
# XGBoost model pre-defined fixed grid
xgb_fixed_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  trees(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), 
           af.train),
  learn_rate(),
  size = 20)

# --------------------------------------------------------------------------------------------------------------------
# CatBoost model pre-defined fixed grid
catboost_param <- dials::parameters(
        trees(),
        min_n(), # min data in leaf
        tree_depth(range = c(4,10)), # depth
        # In most cases, the optimal depth ranges from 4 to 10. Values in the range from 6 to 10 are recommended. 
        learn_rate() # learning rate
    )

catboost_fixed_grid <- 
    dials::grid_max_entropy(
        catboost_param, 
        size = 20 # set this to a higher number to get better results. I don't want to run this all night, so I set it to 20
    )

# --------------------------------------------------------------------------------------------------------------------
# LightGMB model pre-defined fixed grid
lightgbm_params <- dials::parameters(
# The parameters have sane defaults, but if you have some knowledge 
# of the process you can set upper and lower limits to these parameters.
min_n(), # 2nd important
tree_depth(), # 3rd most important
trees()
)

lightgmb_fixed_grid <- 
    dials::grid_max_entropy(
        lightgbm_params, 
        size = 20 # set this to a higher number to get better results. I don't want to run this all night, so I set it to 20
    )

# --------------------------------------------------------------------------------------------------------------------

head(rf_fixed_grid)
head(xgb_fixed_grid)
head(catboost_fixed_grid)
head(lightgmb_fixed_grid)

STEP 2: Setting model specifications

We are going to set up 10 different regression models, each using thier own algorithm:

  1. normal linear model (lm)
  2. generalized linear model with penalized maximum likelihood (glm)
  3. spline regression model
  4. k nearest neighbour (kNN) model
  5. principal component algorithim model (PCA)
  6. support vector machine (SVM) model
  7. random forest (RF) model
  8. extreme gradient boosting (XGBoost) model
  9. categorical feature boost (CatBoost) model
  10. light gradient boosting machine (lightGMB) model
# --------------------------------------------------------------------------------------------------------------------
lm_model <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm") 

# --------------------------------------------------------------------------------------------------------------------
glm_model <- linear_reg(
  mode = "regression",
  penalty = tune(),
  mixture = tune()
) %>%
  set_engine("glmnet")

# --------------------------------------------------------------------------------------------------------------------
spline_model <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm") 

# --------------------------------------------------------------------------------------------------------------------
knn_model <- nearest_neighbor() %>% 
  set_mode("regression") %>% 
  set_engine("kknn") 

# --------------------------------------------------------------------------------------------------------------------
pca_model <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm") 

# --------------------------------------------------------------------------------------------------------------------
svm_model <- svm_poly() %>%
  set_mode("regression") %>%
  set_engine("kernlab")

# --------------------------------------------------------------------------------------------------------------------
rf_model <- rand_forest(mtry = tune(),
                        min_n = tune(), 
                        trees = tune()
) %>% 
  set_mode("regression") %>% 
  set_engine("ranger", 
             importance = "permutation")

# --------------------------------------------------------------------------------------------------------------------
xgb_model <- parsnip::boost_tree(trees = tune(), 
                        tree_depth = tune(),
                        min_n = tune(),
                        loss_reduction = tune(), # first three: model complexity
                        sample_size = tune(),    # randomness
                        mtry = tune(),           # randomness
                        learn_rate = tune()      # step size
) %>%    
  set_mode("regression") %>% 
  set_engine("xgboost", 
             importance = "permutation") 

# --------------------------------------------------------------------------------------------------------------------
catboost_model <- parsnip::boost_tree(      # Remember to load library(treesnip)
        mode = "regression",
        trees = tune(),
        min_n = tune(),
        learn_rate = tune(),
        tree_depth = tune()
    ) %>%
  set_engine("catboost")

# --------------------------------------------------------------------------------------------------------------------
lightgbm_model <- parsnip::boost_tree(
  mode = "regression",
  trees = tune(),
  min_n = tune(),
  tree_depth = tune(),
) %>%
  set_engine("lightgbm") 

STEP 3: Model recipes - pre-processing or feature engineering

Its always a good idea to consult the help function and read carefully what each pre-processing step does. According to the tidymodels recipe package page a recommended pre-processing outline could look like this:

  1. Impute
  2. Handle factor levels
  3. Individual transformations for skewness and other issues
  4. Discretize (if needed and if you have no other choice)
  5. Create dummy variables
  6. Create interactions
  7. Normalization steps (center, scale, range, etc)
  8. Multivariate transformation (e.g. PCA, spatial sign, etc)

However, it should be emphasised that every individual project are different and comes with its own needs. And this suggested order of steps presented here should work for most problems. Normally one has to adapet the pre-proccessing/feature engenering steps of features depending on what models the data is fed into. Pre-processing alters the data to make our model(s) perform better and make the training process less computational expensive. While many models (e.g. nearest neighbour and SVM) require careful and complex pre-processing to produce accurate predictions. Others, such as XGBoost, does not need that as it is an algorithm robust against highly skewed and/or correlated data. Hence the amount of pre-processing required for XGBoost is minimal. Nevertheless, we can still benefit from some preprocessing. See an overview of the recommendd pre-processing steps for some selected models in the book tidymodels with R.

# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Base recipe for the agroforestry biophysical predictor features, key pre-processing steps 
agrofor_base_recipe <- 
  recipe(formula = logRR ~ ., data = af.train) %>%
  update_role(Site.Type, new_role = "predictor") %>% # alters an existing role in the recipe to variables, or assigns an initial role to variables that do not yet have a declared role.
  update_role(PrName, 
              Out.SubInd,
              Out.SubInd.Code,
              Product,
              Latitude,
              Longitude,
              Tree,
              new_role = "sample ID") %>%
  step_novel(Site.Type, -all_outcomes()) %>% # assign a previously unseen factor level to a new value.
  step_dummy(Site.Type, one_hot = TRUE, naming = partial(dummy_names,sep = "_")) %>% # convert nominal data into one or more numeric binary model terms for the levels of the original data.
  step_zv(all_predictors()) %>% # remove variables that contain only a single value.
  step_nzv(all_numeric_predictors()) # potentially remove variables that are highly sparse and unbalanced.
 
# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Linear model - lm recipe
lm_recipe <- 
  agrofor_base_recipe %>%
  step_impute_mean(all_numeric_predictors()) %>% # substitute missing values of numeric variables by the training set mean of those variables.
  step_corr(all_numeric_predictors()) # potentially remove variables that have large absolute correlations with other variables.
  
   
# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Generalised linear model recipe
glm_recipe <- 
  agrofor_base_recipe %>%
  step_impute_mean(all_numeric_predictors()) %>% # substitute missing values of numeric variables by the training set mean of those variables.
  step_corr(all_numeric_predictors()) # potentially remove variables that have large absolute correlations with other variables.
    

# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Spline model recipe
spline_recipe <-  
  agrofor_base_recipe %>%
  step_impute_mean(all_numeric_predictors()) %>% # substitute missing values of numeric variables by the training set mean of those variables.
  step_corr(all_numeric_predictors()) %>% # potentially remove variables that have large absolute correlations with other variables.
  step_YeoJohnson(all_predictors()) %>% # will transform data using a simple Yeo-Johnson transformation.
  step_bs(all_predictors()) # will create new columns that are basis expansions of variables using B-splines.
 

# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# knn model recipe
knn_recipe <- 
  agrofor_base_recipe %>%
  step_normalize(all_numeric_predictors()) %>% # will normalize numeric data to have a standard deviation of one and a mean of zero.
  step_YeoJohnson(all_predictors()) # will transform data using a simple Yeo-Johnson transformation.

# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# PCA model recipe
pca_recipe <- 
  agrofor_base_recipe %>%
  step_BoxCox(all_numeric_predictors()) %>% # will transform data using a simple Box-Cox transformation.
  step_pca(all_predictors(), num_comp = 5)


# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# SVM model recipe
svm_recipe <- 
  agrofor_base_recipe %>%
  step_impute_knn(all_numeric_predictors()) %>% # substitute missing values of numeric variables by the training set mean of those variables.
  step_corr(all_numeric_predictors()) %>% # potentially remove variables that have large absolute correlations with other variables.
  step_interact(~ all_numeric_predictors():all_numeric_predictors(), skip = FALSE) %>% # will create new columns that are interaction terms between two or more variables.
  step_normalize(all_numeric_predictors()) # normalize numeric data to have a standard deviation of one and a mean of zero.

# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Random forest and XGBoost model recipe
rf_xgboost_recipe <- 
  agrofor_base_recipe %>%
  step_impute_mean(all_numeric_predictors()) # substitute missing values of numeric variables by the training set mean of those variables.


# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Catboost model recipe
catboost_recipe <- 
  agrofor_base_recipe %>%
  step_impute_mean(all_numeric_predictors()) # substitute missing values of numeric variables by the training set mean of those variables.

# ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# LightGMB
lightgmb_recipe <- 
  agrofor_base_recipe %>%
  step_impute_mean(all_numeric_predictors()) # substitute missing values of numeric variables by the training set mean of those variables.

spline_recipe %>% prep() %>% juice() %>% glimpse()

STEP 4: Defining model workflows

lm_wf <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(lm_recipe)

glm_wf <- workflow() %>% 
  add_model(glm_model) %>% 
  add_recipe(glm_recipe)

spline_wf <- workflow() %>% 
  add_model(spline_model) %>% 
  add_recipe(spline_recipe)

knn_wf <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(knn_recipe)

pca_wf <- workflow() %>% 
  add_model(pca_model) %>% 
  add_recipe(pca_recipe)

svm_wf <- workflow() %>% 
  add_model(svm_model) %>% 
  add_recipe(svm_recipe)

rf_wf <- workflow() %>% 
  add_model(rf_model) %>% 
  add_recipe(rf_xgboost_recipe)

xgb_wf <- workflow() %>% 
  add_model(xgb_model) %>% 
  add_recipe(rf_xgboost_recipe)

catboost_wf <- 
    workflows::workflow() %>%
    add_model(catboost_model) %>% 
    add_recipe(catboost_recipe)

lightgmb_wf <-
  workflows::workflow() %>%
  add_model(lightgbm_model) %>% 
  add_recipe(lightgmb_recipe)

# SAVING WORKFLOWS
saveRDS(lm_wf, here::here("TidyMod_OUTPUT","lm_wf.RDS"))
saveRDS(glm_wf, here::here("TidyMod_OUTPUT","glm_wf.RDS"))
saveRDS(spline_wf, here::here("TidyMod_OUTPUT","spline_wf.RDS"))
saveRDS(knn_wf, here::here("TidyMod_OUTPUT","knn_wf.RDS"))
saveRDS(pca_wf, here::here("TidyMod_OUTPUT","pca_wf.RDS"))
saveRDS(svm_wf, here::here("TidyMod_OUTPUT","svm_wf.RDS"))
saveRDS(rf_wf, here::here("TidyMod_OUTPUT","rf_wf.RDS"))
saveRDS(xgb_wf, here::here("TidyMod_OUTPUT","xgb_wf.RDS"))
saveRDS(catboost_wf, here::here("TidyMod_OUTPUT","catboost_wf.RDS"))
saveRDS(lightgmb_wf, here::here("TidyMod_OUTPUT","lightgmb_wf.RDS"))

STEP 5: Model (hyper)-parameter tuning - combining workflow and recipies

Note: This took around 8 hours on Amazon’s AWS EC2 (Amazon Elastic Compute Cloud)1, Instance type “r5.4xlarge” with 16 vCPU and 128 GiB Memomory. Check out On-Demand Pricing for price ranges of Amazon’s AWS EC2 machines.

Using the packages parallelMap, parallelly and parallel to run the (hyper)-parameter tuning in parallel, in order to increase efficiency and processing

# Initializing parallel processing 
parallelStartSocket(cpus = detectCores()) 

##########################################################################
# Spatial k-fold cross validation 
##########################################################################
lm_fit_spatcv <- fit_resamples(
  lm_wf,
  resamples = spatial_cv_fold,
  control = model.control,
  metrics = multi.metric)

glm_fit_spatcv <- tune_grid(
  glm_wf,
  resamples = spatial_cv_fold,
  control = model.control,
  metrics = multi.metric)

spline_fit_spatcv <- fit_resamples(
  spline_wf,
  resamples = spatial_cv_fold,
  metrics = multi.metric,
  control = model.control)

knn_fit_spatcv <- fit_resamples(
  knn_wf,
  resamples = spatial_cv_fold,
  metrics = multi.metric,
  control = model.control)

pca_fit_spatcv <- fit_resamples(
  pca_wf,
  resamples = spatial_cv_fold,
  metrics = multi.metric,
  control = model.control)

svm_fit_spatcv <- fit_resamples(
  svm_wf,
  resamples = spatial_cv_fold,
  metrics = multi.metric,
  control = model.control)

rf_param_fit_spatcv <- tune_grid(
  rf_wf,
  resamples = spatial_cv_fold,
  grid = rf_fixed_grid,
  metrics = multi.metric,
  control = model.control)

xgboost_param_fit_spatcv <- tune_grid(
  xgb_wf,
  resamples = spatial_cv_fold,
  grid = xgb_fixed_grid,
  metrics = multi.metric,
  control = model.control)

catboost_param_fit_spatcv <- tune::tune_grid(
  object= catboost_wf,
  resamples = spatial_cv_fold,
  grid = catboost_fixed_grid,
  metrics = multi.metric,
  control = model.control)

lightgmb_param_fit_spatcv <- tune::tune_grid(
  object= lightgmb_wf,
  resamples = spatial_cv_fold,
  grid = lightgmb_fixed_grid,
  metrics = multi.metric,
  control = model.control)

##########################################################################
# Normal (random) k-fold cross validation (CV-fold)  
##########################################################################

lm_fit_cv <- fit_resamples(
   lm_wf,
   resamples = cv_fold,
   control = model.control,
   metrics = multi.metric)

glm_fit_cv <- tune_grid(
   glm_wf,
   resamples = cv_fold,
   control = model.control,
   metrics = multi.metric)

 spline_fit_cv <- fit_resamples(
   spline_wf,
   resamples = cv_fold,
   metrics = multi.metric,
   control = model.control)

 knn_fit_cv <- fit_resamples(
   knn_wf,
   resamples = cv_fold,
   metrics = multi.metric,
   control = model.control)

 pca_fit_cv <- fit_resamples(
   pca_wf,
   resamples = cv_fold,
   metrics = multi.metric,
   control = model.control)

 svm_fit_cv <- fit_resamples(
  svm_wf,
  resamples = cv_fold,
  metrics = multi.metric,
  control = model.control)

 rf_param_fit_cv <- tune_grid(
   rf_wf,
   resamples = cv_fold,
   grid = rf_fixed_grid,
   metrics = multi.metric,
   control = model.control)

 xgboost_param_fit_cv <- tune_grid(
   xgb_wf,
   resamples = cv_fold,
   grid = xgb_fixed_grid,
   metrics = multi.metric,
   control = model.control)

 catboost_param_fit_cv <- tune::tune_grid(
   catboost_wf,
   resamples = cv_fold,
   grid = catboost_fixed_grid,
   metrics = multi.metric,
   control = model.control)
 
 lightgmb_param_fit_cv <- tune::tune_grid(
  object= lightgmb_wf,
  resamples = cv_fold,
  grid = lightgmb_fixed_grid,
  metrics = multi.metric,
  control = model.control)

# Terminating parallel session
parallelStop()

Saving model fittings

##########################################################################
# Saving models made with spatial k-fold cross validation

#lm 
saveRDS(lm_fit_spatcv, here::here("TidyMod_OUTPUT","lm_fit_spatcv.RDS"))

# glm
saveRDS(glm_fit_spatcv, here::here("TidyMod_OUTPUT","glm_fit_spatcv.RDS"))

# spline
saveRDS(spline_fit_spatcv, here::here("TidyMod_OUTPUT","spline_fit_spatcv.RDS"))

# knn
saveRDS(knn_fit_spatcv, here::here("TidyMod_OUTPUT","knn_fit_spatcv.RDS"))

# pca
saveRDS(pca_fit_spatcv, here::here("TidyMod_OUTPUT","pca_fit_spatcv.RDS"))

# svm
saveRDS(svm_fit_spatcv, here::here("TidyMod_OUTPUT","svm_fit_spatcv.RDS"))

# rf_param_fit_spatcv
saveRDS(rf_param_fit_spatcv, here::here("TidyMod_OUTPUT","rf_param_fit_spatcv.RDS"))

# xgboost_param_fit_spatcv
saveRDS(xgboost_param_fit_spatcv, here::here("TidyMod_OUTPUT","xgboost_param_fit_spatcv.RDS"))

# catboost_res_param_fit
saveRDS(catboost_param_fit_spatcv, here::here("TidyMod_OUTPUT","catboost_param_fit_spatcv.RDS"))

# lightgmb_param_fit_spatcv
saveRDS(lightgmb_param_fit_spatcv, here::here("TidyMod_OUTPUT","lightgmb_param_fit_spatcv.RDS"))

##########################################################################
# Saving models made with normal (random) k-fold cross validation (CV-fold)

#lm 
saveRDS(lm_fit_cv, here::here("TidyMod_OUTPUT","lm_fit_cv.RDS"))

# glm
saveRDS(glm_fit_cv, here::here("TidyMod_OUTPUT","glm_fit_cv.RDS"))

# spline
saveRDS(spline_fit_cv, here::here("TidyMod_OUTPUT","spline_fit_cv.RDS"))

# knn
saveRDS(knn_fit_cv, here::here("TidyMod_OUTPUT","knn_fit_cv.RDS"))

# pca
saveRDS(pca_fit_cv, here::here("TidyMod_OUTPUT","pca_fit_cv.RDS"))

# svm
saveRDS(svm_fit_cv, here::here("TidyMod_OUTPUT","svm_fit_cv.RDS"))

# rf_resample_param_fit_cv
saveRDS(rf_param_fit_cv, here::here("TidyMod_OUTPUT","rf_param_fit_cv.RDS"))

# xgboost_res_param_fit_cv
saveRDS(xgboost_param_fit_cv, here::here("TidyMod_OUTPUT","xgboost_param_fit_cv.RDS"))

# catboost_res_param_fit_cv
saveRDS(catboost_param_fit_cv, here::here("TidyMod_OUTPUT","catboost_param_fit_cv.RDS"))

# lightgmb_param_fit_cv
saveRDS(lightgmb_param_fit_cv, here::here("TidyMod_OUTPUT","lightgmb_param_fit_cv.RDS"))

Generating (hyper)-parameter tuning combinations from the tuning of the tree-based models

##########################################################################
# Spatial k-fold cross validation

# Random forest
rf_param_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_param_fit_spatcv.rds"))

rf_hyper_param_plot_spatcv <- 
  rf_param_fit_spatcv %>%
  collect_metrics() %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::select(mean, mtry:min_n) %>%
  pivot_longer(mtry:min_n,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  geom_line(size = 0.8, alpha = 0.5, linetype = 8) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rmse") +
  ggtitle("rf_spatcv hyper-parameters") +
  theme_lucid(plot.title.size = 20)

# XGBoost
xgboost_param_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "xgboost_param_fit_spatcv.rds"))

xgboost_hyper_param_plot_spatcv <- 
  xgboost_param_fit_spatcv %>%
  collect_metrics() %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  geom_line(size = 0.8, alpha = 0.5, linetype = 8) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rmse") +
  ggtitle("xgboost_spatcv hyper-parameters") +
  theme_lucid(plot.title.size = 20)

# CatBoost
catboost_param_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "catboost_param_fit_spatcv.rds"))

catboost_hyper_param_plot_spatcv <- 
  catboost_param_fit_spatcv %>%
  collect_metrics() %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::select(mean, trees:learn_rate) %>%
  pivot_longer(trees:learn_rate,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  geom_line(size = 0.8, alpha = 0.5, linetype = 8) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rmse") +
  ggtitle("catboost_spatcv hyper-parameters") +
  theme_lucid(plot.title.size = 20)

# LightGMB
lightgmb_param_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "lightgmb_param_fit_spatcv.rds"))

lightgmb_param_plot_spatcv <- 
  lightgmb_param_fit_spatcv %>%
  collect_metrics() %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::select(mean, trees:tree_depth) %>%
  pivot_longer(trees:tree_depth,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  geom_line(size = 0.8, alpha = 0.5, linetype = 8) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rmse") +
  ggtitle("lightgmb_spatcv hyper-parameters") +
  theme_lucid(plot.title.size = 20)

##########################################################################
# Normal (random) k-fold cross validation (CV-fold)

# Random forest
rf_param_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_param_fit_cv.rds"))

rf_hyper_param_plot_cv <- 
  rf_param_fit_cv %>%
  collect_metrics() %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::select(mean, mtry:min_n) %>%
  pivot_longer(mtry:min_n,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_line(size = 0.8, alpha = 0.5, linetype = 8) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rmse") +
  ggtitle("rf_cv hyper-parameters") +
  theme_lucid(plot.title.size = 20)

# XGBoost
xgboost_param_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "xgboost_param_fit_cv.rds"))

xgboost_hyper_param_plot_cv <- 
  xgboost_param_fit_cv %>%
  collect_metrics() %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::select(mean, mtry:sample_size) %>%
  pivot_longer(mtry:sample_size,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_line(size = 0.8, alpha = 0.5, linetype = 8) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rmse") +
  ggtitle("xgboost_cv hyper-parameters") +
  theme_lucid(plot.title.size = 20)

# CatBoost
catboost_param_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "catboost_param_fit_cv.rds"))

catboost_hyper_param_plot_cv <- 
  catboost_param_fit_cv %>%
  collect_metrics() %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::select(mean, trees:learn_rate) %>%
  pivot_longer(trees:learn_rate,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  geom_line(size = 0.8, alpha = 0.5, linetype = 8) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rmse") +
  ggtitle("catboost_cv hyper-parameters") +
  theme_lucid(plot.title.size = 20)

lightgmb_param_plot_cv <- 
  lightgmb_param_fit_cv %>%
  collect_metrics() %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::select(mean, trees:tree_depth) %>%
  pivot_longer(trees:tree_depth,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  geom_line(size = 0.8, alpha = 0.5, linetype = 8) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rmse") +
  ggtitle("lightgmb_cv hyper-parameters") +
  theme_lucid(plot.title.size = 20)

# -------------------------------------------------------------------------------------------------------
# Saving the plots as RDS

# Spatial k-fold cross validation
saveRDS(rf_hyper_param_plot_spatcv, here::here("TidyMod_OUTPUT","rf_hyper_param_plot_spatcv.RDS"))
saveRDS(xgboost_hyper_param_plot_spatcv, here::here("TidyMod_OUTPUT","xgboost_hyper_param_plot_spatcv.RDS"))
saveRDS(catboost_hyper_param_plot_spatcv, here::here("TidyMod_OUTPUT","catboost_hyper_param_plot_spatcv.RDS"))
saveRDS(lightgmb_param_plot_spatcv, here::here("TidyMod_OUTPUT","lightgmb_param_plot_spatcv.RDS"))

# Normal (random) k-fold cross validation (CV-fold)
saveRDS(rf_hyper_param_plot_cv, here::here("TidyMod_OUTPUT","rf_hyper_param_plot_cv.RDS"))
saveRDS(xgboost_hyper_param_plot_cv, here::here("TidyMod_OUTPUT","xgboost_hyper_param_plot_cv.RDS"))
saveRDS(catboost_hyper_param_plot_cv, here::here("TidyMod_OUTPUT","catboost_hyper_param_plot_cv.RDS"))
saveRDS(lightgmb_param_plot_cv, here::here("TidyMod_OUTPUT","lightgmb_param_plot_cv.RDS"))

Visualizing (hyper)-parameter tuning combinations of tree-based models

Random forest

Show code
rf_hyper_param_plot_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_hyper_param_plot_spatcv.RDS"))
rf_hyper_param_plot_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_hyper_param_plot_cv.RDS"))

rf_hyper_param_plot <- 
  gridExtra::grid.arrange(rf_hyper_param_plot_spatcv + theme(legend.position="none"), 
                          rf_hyper_param_plot_cv + theme(legend.position="none"), 
                          ncol = 2)
Random forest model (hyper)-parameter tuning combinations - metric: RMSE

Figure 7: Random forest model (hyper)-parameter tuning combinations - metric: RMSE

XGBoost

Show code
xgboost_hyper_param_plot_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "xgboost_hyper_param_plot_spatcv.RDS"))
xgboost_hyper_param_plot_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "xgboost_hyper_param_plot_cv.RDS"))

xgb_hyper_param_plot <- 
  gridExtra::grid.arrange(xgboost_hyper_param_plot_spatcv + theme(legend.position="none"), 
                          xgboost_hyper_param_plot_cv + theme(legend.position="none"), 
                          ncol = 2)
XGBoost model (hyper)-parameter tuning combinations - metric: RMSE

Figure 8: XGBoost model (hyper)-parameter tuning combinations - metric: RMSE

CatBoost

Show code
catboost_hyper_param_plot_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "catboost_hyper_param_plot_spatcv.RDS"))
catboost_hyper_param_plot_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "catboost_hyper_param_plot_cv.RDS"))

catboost_hyper_param_plot <- 
  gridExtra::grid.arrange(catboost_hyper_param_plot_spatcv + theme(legend.position="none"), 
                          catboost_hyper_param_plot_cv + theme(legend.position="none"), 
                          ncol = 2)
CatBoost model (hyper)-parameter tuning combinations - metric: RMSE

Figure 9: CatBoost model (hyper)-parameter tuning combinations - metric: RMSE

LightGMB

Show code
lightgmb_hyper_param_plot_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "lightgmb_param_plot_spatcv.RDS"))
lightgmb_hyper_param_plot_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "lightgmb_param_plot_cv.RDS"))

lightgmb_hyper_param_plot <- 
  gridExtra::grid.arrange(lightgmb_hyper_param_plot_spatcv + theme(legend.position="none"), 
                          lightgmb_hyper_param_plot_cv + theme(legend.position="none"), 
                          ncol = 2)
LightGMB model (hyper)-parameter tuning combinations - metric: RMSE

Figure 10: LightGMB model (hyper)-parameter tuning combinations - metric: RMSE

STEP 6: Evaluating model performances

Model results of all models prior to selection of best (hyper)-parameter combinations

################################################################################
# Spatial k-fold cross validation

# ------------------------------------------------------------------------------
# Loading the models from previous model run

lm_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "lm_fit_spatcv.RDS"))
glm_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "glm_fit_spatcv.rds"))
spline_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "spline_fit_spatcv.rds"))
knn_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "knn_fit_spatcv.rds"))
pca_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "pca_fit_spatcv.rds"))
svm_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "svm_fit_spatcv.rds"))
rf_param_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_param_fit_spatcv.rds"))
xgboost_param_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "xgboost_param_fit_spatcv.rds"))
catboost_param_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "catboost_param_fit_spatcv.rds"))
lightgmb_param_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "lightgmb_param_fit_spatcv.rds"))

# Create a tibble with all model results ---------------------------------------
model_results_spatcv <- tibble(model = list(lm_fit_spatcv, 
                                        glm_fit_spatcv, 
                                        spline_fit_spatcv,
                                        knn_fit_spatcv,
                                        pca_fit_spatcv,
                                        svm_fit_spatcv,
                                        
                                        rf_param_fit_spatcv,
                                        xgboost_param_fit_spatcv,
                                        catboost_param_fit_spatcv,
                                        lightgmb_param_fit_spatcv),
                           model_name = c("lm", "glm", "spline", "knn", "pca", "svm", "RF", "XGB", "CatB", "liGMB"))

# Create a helper function for collecting model metrics ------------------------

map_collect_metrics <- function(model){
  
  model %>% 
    select(id, .metrics) %>% 
    unnest()
}

# Applying the helper function and extracting the model metrics ----------------

model_results_spatcv <- model_results_spatcv %>% 
  mutate(res = map(model, map_collect_metrics)) %>% 
  select(model_name, res) %>% 
  unnest(res)

# SAVING --------------------------------------------------------------------------------
# Spatial k-fold cross validation
saveRDS(model_results_spatcv, here::here("TidyMod_OUTPUT","model_results_spatcv.RDS"))

################################################################################
# Normal (random) k-fold cross validation (CV-fold)

# ------------------------------------------------------------------------------
# Loading the models from previous model run

lm_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "lm_fit_cv.rds"))
glm_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "glm_fit_cv.rds"))
spline_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "spline_fit_cv.rds"))
knn_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "knn_fit_cv.rds"))
pca_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "pca_fit_cv.rds"))
svm_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "svm_fit_cv.rds"))
rf_param_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_param_fit_cv.rds"))
xgboost_param_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "xgboost_param_fit_cv.rds"))
catboost_param_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "catboost_param_fit_cv.rds"))
lightgmb_param_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "lightgmb_param_fit_cv.rds"))

# Create a tibble with all model results ---------------------------------------
model_results_cv <- tibble(model = list(lm_fit_cv, 
                                        glm_fit_cv, 
                                        spline_fit_cv,
                                        knn_fit_cv,
                                        pca_fit_cv,
                                        svm_fit_cv,
                                        
                                        rf_param_fit_cv,
                                        xgboost_param_fit_cv,
                                        catboost_param_fit_cv,
                                        lightgmb_param_fit_cv),
                           model_name = c("lm", "glm", "spline", "knn", "pca", "svm", "RF", "XGB", "CatB", "liGMB"))

# Create a helper function for collecting model metrics ------------------------

map_collect_metrics <- function(model){
  
  model %>% 
    select(id, .metrics) %>% 
    unnest()
}

# Applying the helper function and extracting the model metrics ----------------

model_results_cv <- model_results_cv %>% 
  mutate(res = map(model, map_collect_metrics)) %>% 
  select(model_name, res) %>% 
  unnest(res)

# SAVING --------------------------------------------------------------------------------
# Normal (random) k-fold cross validation (CV-fold)
saveRDS(model_results_cv, here::here("TidyMod_OUTPUT","model_results_cv.RDS"))

Visualising model performances with boxplots

Defining a custom colours range

Show code
nb.cols.10 <- 10
af.ml.colors.10 <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols.10)

nb.cols.9 <- 9
af.ml.colors.9 <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols.9)

Models tuned/fit with spatial clustiering cv-folds

Show code
################################################################################
# Spatial k-fold cross validation
# ------------------------------------------------------------------------------
# Loading the models 
model_results_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","model_results_spatcv.RDS"))

model_results_all_plot_spatcv <- model_results_spatcv %>%
  dplyr::filter(.estimate < 2) %>%                  # <- Its necessary to filter the metric estimates!
  ggplot(aes(x = model_name, y = .estimate, color = model_name)) +
  geom_boxplot(fill = "white", alpha = 0.8, size = 2) +
  geom_jitter(width = 0.2, height = 0.2) +
  scale_color_manual(values = af.ml.colors.10) +
  facet_wrap(~.metric, scales = "free_y") +
  see::theme_lucid(axis.text.angle = 45, plot.title.size = 20) +
  ggtitle("Evaluation of models tuned/fit with spatial clustering cv-folds - all metrics")

# Plotting
model_results_all_plot_spatcv
Model evaluation, resampling: Spatial clustering cv-folds - metrics: CCC, RMSE, MAE, rsq

Figure 11: Model evaluation, resampling: Spatial clustering cv-folds - metrics: CCC, RMSE, MAE, rsq

Models tuned/fit with normal cv-folds

Show code
################################################################################
# Normal (random) k-fold cross validation (CV-fold)
# ------------------------------------------------------------------------------
# Loading the models 
model_results_cv <- readRDS(file = here::here("TidyMod_OUTPUT","model_results_cv.RDS"))

model_results_all_plot_cv <- model_results_cv %>%
  dplyr::filter(.estimate < 2) %>%                  # <- Its necessary to filter the metric estimates!
  ggplot(aes(x = model_name, y = .estimate, color = model_name)) +
  geom_boxplot(fill = "white", alpha = 0.8, size = 2) +
  geom_jitter(width = 0.2, height = 0.2) +
  scale_color_manual(values = af.ml.colors.10) +
  facet_wrap(~.metric, scales = "free_y") +
  see::theme_lucid(axis.text.angle = 45, plot.title.size = 20) +
  ggtitle("Evaluation of models tuned/fit with normal cv-folds - all metrics")

# ------------------------------------------------------------------------------
# Plotting
model_results_all_plot_cv
Model evaluation, resampling: Normal cv-folds - metrics: CCC, RMSE, MAE, rsq

Figure 12: Model evaluation, resampling: Normal cv-folds - metrics: CCC, RMSE, MAE, rsq

We see that the simpler models, e.g. lm and spline perform relatively good compared to the more complex (tree-based models). However, because we did not select best performing (hyper)-parameters yet we see that the tree-based models have instances that do perform much better.

Variability in folds for the different models

Show code
################################################################################
# Spatial k-fold cross validation
model_evaluation_fold_variability_plot_spatcv <- 
  model_results_spatcv %>% 
  dplyr::filter(.estimate < 5) %>% 
  ggplot(aes(x = model_name, y = .estimate, color = id, group = id)) + 
  geom_line(size = 1.5, alpha = 0.8) + 
  facet_wrap(~.metric, scales = "free_y") +
  scale_color_manual(values = af.ml.colors.10) +
  see::theme_lucid(axis.text.angle = 45, plot.title.size = 20) +
  ggtitle("Variability in metrics between each cv-fold for all tested models, resampling: Spatial clustering cv-folds")

model_evaluation_fold_variability_plot_spatcv
Variability in metrics between each cv-fold for all tested models, resampling: Spatial clustering cv-folds - metrics: CCC, RMSE, MAE, rsq

Figure 13: Variability in metrics between each cv-fold for all tested models, resampling: Spatial clustering cv-folds - metrics: CCC, RMSE, MAE, rsq

Show code
################################################################################
# Normal (random) k-fold cross validation (CV-fold)
model_evaluation_fold_variability_plot_cv <- 
  model_results_cv %>% 
  dplyr::filter(.estimate < 5) %>% 
  ggplot(aes(x = model_name, y = .estimate, color = id, group = id)) + 
  geom_line(size = 1.5, alpha = 0.8) + 
  facet_wrap(~.metric, scales = "free_y") +
  scale_color_manual(values = af.ml.colors.10) +
  see::theme_lucid(axis.text.angle = 45, plot.title.size = 20) +
  ggtitle("Variability in metrics between each cv-fold for all tested models, resampling: Normal cv-folds")

model_evaluation_fold_variability_plot_cv
Variability in metrics between each cv-fold for all tested models, resampling: Normal cv-folds - metrics: CCC, RMSE, MAE, rsq

Figure 14: Variability in metrics between each cv-fold for all tested models, resampling: Normal cv-folds - metrics: CCC, RMSE, MAE, rsq

We see a large variability in some of the models, especially the spline model - hence it might perform well in general, but there are large differences between the folds, that we have to be aware of.

Variability in metric suitability for the different model comparisons

Show code
################################################################################
# Spatial k-fold cross validation
model_metric_suitability_density_plot_spatcv <- 
  model_results_spatcv %>% 
  dplyr::filter(.estimate < 5) %>% 
  ggplot(aes(x = .estimate, color = model_name, fill = model_name)) + 
  geom_density(alpha = 0.1, size = 1) + 
  facet_wrap(~.metric, scales = "free") +
  scale_color_manual(values = af.ml.colors.10) +
  scale_fill_manual(values = af.ml.colors.10) +
  see::theme_lucid(plot.title.size = 20) +
  ggtitle("Metric suitability density plot for all tested models, resampling: Spatial clustering cv-folds") +
  xlab("Estimated logRR values") +
  ylab("Density")

model_metric_suitability_density_plot_spatcv
Density plot of metric suitability for all tested models, resampling: Spatial clustering cv-folds - metrics: CCC, RMSE, MAE, rsq

Figure 15: Density plot of metric suitability for all tested models, resampling: Spatial clustering cv-folds - metrics: CCC, RMSE, MAE, rsq

It seems like the spline and kNN models is outperforming many of the other models based on the RMSE metric - as these models have the lowest value for RMSE. However, their spread is not very good. They seem to both be generally over-estimating logRR values, compared to the other models. This is why it is extremely powerful to have more than one metric when comparing these machine learning modelling algorithms. Because one model might outperform other models in one aspect but not on other, more generally important aspects. Keep this in mind!

Show code
################################################################################
# Normal (random) k-fold cross validation (CV-fold)
model_metric_suitability_density_plot_cv <- 
  model_results_cv %>% 
  dplyr::filter(.estimate < 5) %>%
  ggplot(aes(x = .estimate, color = model_name, fill = model_name)) + 
  geom_density(alpha = 0.1, size = 1) + 
  facet_wrap(~.metric, scales = "free") +
  scale_color_manual(values = af.ml.colors.10) +
  scale_fill_manual(values = af.ml.colors.10) +
  see::theme_lucid(plot.title.size = 20) +
  ggtitle("Metric suitability density plot for all tested models, resampling: Normal cv-folds") +
  xlab("Estimated logRR values") +
  ylab("Density")

model_metric_suitability_density_plot_cv
Density plot of metric suitability for all tested models, resampling: Normal cv-folds - metrics: CCC, RMSE, MAE, rsq

Figure 16: Density plot of metric suitability for all tested models, resampling: Normal cv-folds - metrics: CCC, RMSE, MAE, rsq

There are large differences in the estimate density distributions for the different models. Yet, the RMSE seems to be a reasonable metric. However, there seem to be no metric that really suits our data. We can decide to continue to make use of RMSE and MAE as these are generally used for regression problems.

Note: Pay attention to how different the desity distribution is for all metrics when comparing the two different resampling techniques, spatial clustering cv-folds and the normal cv-folds.

If we recall again the logRR distribution, that is a little positively skewed, it is easier to interpret the metric suitability density plots. Optimally we are interested in metrics that cover the whole “real” distribution of logRR.

Show code
logRR_distribution <- agrofor.biophys.modelling.data %>%
  rationalize(logRR) %>%
  drop_na(logRR) %>%
  ggplot(aes(x = logRR)) +
  geom_density(aes(y = ..count..), fill = "lightgray") +
  geom_vline(aes(xintercept = mean(logRR)), 
             linetype = "dashed", 
             size = 1, color = "#FC4E07") +
  scale_x_continuous(name="Observed logRR values", limits=c(-2.5, 2.5)) +
  ggtitle("Density/count distribution of real logRR values for the ERA agroforestry data") +
  ylab("Count (number of oblersavions)") +
  see::theme_lucid()

logRR_distribution
Density/count distribution of real logRR values

Figure 17: Density/count distribution of real logRR values

Show code
#Warning: Removed 78 rows containing non-finite values (stat_density).

Summerising model metric performances

Show code
################################################################################
# Spatial k-fold cross validation
model_results_spatcv %>% 
  dplyr::filter(.metric == "rmse") %>%
  mutate_if(sapply(model_results_spatcv, is.character), as.factor) %>%
  group_by(model_name, .metric) %>% 
  summarise(mean = mean(.estimate)) %>%
  arrange(desc(-mean)) %>%
  ggplot() + 
  geom_bar(aes(x = reorder(model_name, mean), y = mean), stat = 'identity') +
  ggtitle("Mean RMSE metric for all tested models, resampling: Normal cv-folds") +
  ylab("Mean RMSE") +
  xlab("Model") +
  see::theme_lucid() 
Summarising model metric performances, resampling: Spatial clustering cv-folds - metric: RMSE

Figure 18: Summarising model metric performances, resampling: Spatial clustering cv-folds - metric: RMSE

Show code
################################################################################
# Normal (random) k-fold cross validation (CV-fold)
model_results_cv %>% 
  dplyr::filter(.metric == "rmse") %>%
  mutate_if(sapply(model_results_cv, is.character), as.factor) %>%
  group_by(model_name, .metric) %>% 
  summarise(mean = mean(.estimate)) %>%
  arrange(desc(-mean)) %>%
  ggplot() + 
  geom_bar(aes(x = reorder(model_name, mean), y = mean), stat = 'identity') +
  ggtitle("Mean RMSE metric for all tested models, resampling: Normal cv-folds") +
  ylab("Mean RMSE") +
  xlab("Model") +
  see::theme_lucid() 
Summarising model metric performances, resampling: Normal cv-folds - metric: RMSE

Figure 19: Summarising model metric performances, resampling: Normal cv-folds - metric: RMSE

Even though the spline model(s) does seem to be superior on all the metrics, we know that its “accuracy” is relatively weak, as we saw the variations between the folds and its density distributions. Hence, the subsequent best models based on best performing (hyper)-parameter combinations are probably more appropriate to use.

Using tidyposterior for model comparisons

We can use the package tidyposterior, part of the tidymodels meta-package, to get a better insight into how the individual models perform post-hoc using Bayesian generalised linear models. cThis is useful when one wants to conduct a post-hoc analyses of resampling results generated by the various models. For example, if two models are evaluated using the root mean squared error (RMSE) metric on our 10-fold cross-validation, there are 10 paired statistics (one for each fold). These can be used to make comparisons between models without involving a test set. Hence, we get a picture on the variation in a specific metric for the models across the different folds.

First, calculating the posterior distributions

The tidyposterior package uses the Stan software for specifying and fitting the models via the rstanarm package.

# library(tidyposterior)

################################################################################
# Spatial k-fold cross validation

model_posterior_spatcv <- 
  model_results_spatcv %>% 
  filter(.metric == "rmse") %>% 
  select(model_name, id, .estimate) %>% 
  pivot_wider(names_from = "model_name", values_from = ".estimate", values_fn = mean)

rmse_model_spatcv <- tidyposterior::perf_mod(model_posterior_spatcv, seed = 412)
# ----------SAVING ----------------- ----------------- ----------------- -----------------
saveRDS(rmse_model_spatcv, file = here::here("TidyMod_OUTPUT", "rmse_model_spatcv.RDS"))
# library(tidyposterior)

################################################################################
# Normal (random) k-fold cross validation (CV-fold)
model_posterior_cv <- 
  model_results_cv %>% 
  filter(.metric == "rmse") %>% 
  select(model_name, id, .estimate) %>% 
  pivot_wider(names_from = "model_name", values_from = ".estimate", values_fn = mean) 

rmse_model_cv <- tidyposterior::perf_mod(model_posterior_cv, seed = 421)
# ----------SAVING ----------------- ----------------- ----------------- -----------------
saveRDS(rmse_model_cv, file = here::here("TidyMod_OUTPUT", "rmse_model_cv.RDS"))
################################################################################
# Spatial k-fold cross validation

# Reading data
rmse_model_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "rmse_model_spatcv.RDS"))

rmse_model_spatcv %>% tidy() %>% group_by(model) %>% summarise(mean = mean(posterior))
# A tibble: 10 × 2
   model   mean
   <chr>  <dbl>
 1 CatB   0.684
 2 glm    0.865
 3 knn    0.899
 4 liGMB  0.685
 5 lm     0.907
 6 pca    0.722
 7 RF     0.736
 8 spline 4.44 
 9 svm    1.31 
10 XGB    0.706
################################################################################
# Normal (random) k-fold cross validation (CV-fold)

# Reading data
rmse_model_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "rmse_model_cv.RDS"))

rmse_model_cv %>% tidy() %>% group_by(model) %>% summarise(mean = mean(posterior))
# A tibble: 10 × 2
   model   mean
   <chr>  <dbl>
 1 CatB   0.710
 2 glm    0.690
 3 knn    0.953
 4 liGMB  0.593
 5 lm     0.682
 6 pca    0.712
 7 RF     0.590
 8 spline 0.610
 9 svm    0.605
10 XGB    0.716

The posterior distribution is useful to determine if there is in fact a difference between the models (for the metric used, e.g. RMSE). The posterior distribution shows the distribution of the difference in metrics between the models. Whether it is higher or lower depends on the metric one choose to use (RMSE, ROC_AUC, etc.). If the distribution’s central tendency is not 0 then there is a difference between the models.

Visualising the posterior distribution with ggplot

We can now visualize the posterior distribution of the models by plotting it with ggplot

Show code
################################################################################
# Spatial k-fold cross validation
post_plot_rmse_model_spatcv <- 
  rmse_model_spatcv %>% tidy() %>%
  ggplot(aes(x = model, y = posterior, color = model)) +
  geom_violin() +
  geom_quasirandom(dodge.width = 0.3, varwidth = FALSE, width = 0.08, alpha = 0.05) +
  stat_summary(fun.y = "median", geom = "point", size = 2, color = "black") +
  stat_summary(fun.y = "median", geom = "line", aes(group = 1), size = 1.1, color = "black") +
  scale_color_manual(values = af.ml.colors.10) +
  ggtitle("Posterior distributions for all tested models, resampling: Spatial clustering cv-folds - metric: RMSE") +
  ylab("Mean RMSE") +
  xlab("Model") +
  see::theme_lucid(axis.text.angle = 45, plot.title.size = 20, legend.title.size = 15, legend.text.size = 15, axis.text.size = 15, tags.size = 15, legend.position = "bottom")

post_plot_rmse_model_spatcv
Posterior distributions for all tested models, resampling: Spatial clustering cv-folds - metric: RMSE

Figure 20: Posterior distributions for all tested models, resampling: Spatial clustering cv-folds - metric: RMSE

Show code
################################################################################
# Normal (random) k-fold cross validation (CV-fold)
post_plot_rmse_model_cv <-
  rmse_model_cv %>% tidy() %>%
  ggplot(aes(x = model, y = posterior, color = model)) +
  geom_violin() +
  geom_quasirandom(dodge.width = 0.3, varwidth = FALSE, width = 0.08, alpha = 0.05) +
  stat_summary(fun.y = "median", geom = "point", size = 2, color = "black") +
  stat_summary(fun.y = "median", geom = "line", aes(group = 1), size = 1.1, color = "black") +
  scale_color_manual(values = af.ml.colors.10) +
  ggtitle("Posterior distributions for all tested models, resampling: Spatial clustering cv-folds - metric: RMSE") +
  ylab("Mean RMSE") +
  xlab("Model") +
  see::theme_lucid(axis.text.angle = 45, plot.title.size = 20, legend.title.size = 15, legend.text.size = 15, axis.text.size = 15, tags.size = 15, legend.position = "bottom")

post_plot_rmse_model_cv
Posterior distributions for all tested models, resampling: Normal cv-folds - metric: RMSE

Figure 21: Posterior distributions for all tested models, resampling: Normal cv-folds - metric: RMSE

These posterior distribution plots are very interesting because they give us a detailed insight into how the models differ from each other. We also see that there are significant differences between the models whether we use spatial clustering cv-folds or normal cv-folds.

Contrast assessment between the models

Finalluy, we can use the contrast assessment between models to understand the

################################################################################
# Spatial k-fold cross validation
contrast_models(rmse_model_spatcv)
# Posterior samples of performance differences
# A tibble: 180,000 × 4
   difference model_1 model_2 contrast  
        <dbl> <chr>   <chr>   <chr>     
 1     -0.728 lm      glm     lm vs. glm
 2     -0.308 lm      glm     lm vs. glm
 3      0.385 lm      glm     lm vs. glm
 4      1.43  lm      glm     lm vs. glm
 5      0.734 lm      glm     lm vs. glm
 6     -0.378 lm      glm     lm vs. glm
 7      0.469 lm      glm     lm vs. glm
 8     -0.479 lm      glm     lm vs. glm
 9     -0.738 lm      glm     lm vs. glm
10      1.13  lm      glm     lm vs. glm
# … with 179,990 more rows
################################################################################
# Normal (random) k-fold cross validation (CV-fold)
contrast_models(rmse_model_cv) %>% 
  group_by(contrast) %>% 
  summarise(mean = mean(difference)) %>% 
  arrange(desc(mean))
# A tibble: 45 × 2
   contrast        mean
   <chr>          <dbl>
 1 knn vs. RF     0.363
 2 knn vs. liGMB  0.360
 3 knn vs. svm    0.348
 4 knn vs. CatB   0.243
 5 knn vs. pca    0.241
 6 knn vs. XGB    0.237
 7 XGB vs. liGMB  0.124
 8 pca vs. RF     0.121
 9 pca vs. liGMB  0.119
10 CatB vs. liGMB 0.117
# … with 35 more rows

Visually showcase of the model contrast assessment

These plots might be a little difficult to interpret but basically they show how different each model is compared to another and also how much better or worse this model is. The x-axis shows the difference for the model and this is centered around xero. If the difference is skewed towards the right then the difference is positive ant it can be said that the given model performs better than the one it’s compared with. The y-axis shows the actual posterior probability. Let’s take the example of the glm model. First look at the top row, this will tell you how much better/worse the glm model is compared to the CatBoost model. In this case the glm model does not really have a strong right skewed distribution difference, indicating that it is not ‘significantly’ different from the CatBoost model. However, it is slightly skewed towards the right, suggesting that it is better. The models that have a positive (above zero) location have a significant better performance!

With spatial clustering cv-folds

Show code
################################################################################
# Spatial k-fold cross validation
contrast_plot_rmse_spatcv <- contrast_models(rmse_model_spatcv) 
  
autoplot(contrast_plot_rmse_spatcv)
Posterior model contrast assessment, resampling: Spatial clustering cv-folds - metric: RMSE

Figure 22: Posterior model contrast assessment, resampling: Spatial clustering cv-folds - metric: RMSE

With normal cv-folds

This picture changes remarkebly when using normal cv-folds as we see below:

Show code
################################################################################
# Normal (random) k-fold cross validation (CV-fold)
contrast_plot_rmse_cv <- contrast_models(rmse_model_cv) 
  
autoplot(contrast_plot_rmse_cv)
Posterior model contrast assessment, resampling: Normal cv-folds - metric: RMSE

Figure 23: Posterior model contrast assessment, resampling: Normal cv-folds - metric: RMSE

STEP 8: Finalizing models and model workflows

Now we have gained significantly detailed information on all the tested models. Until now, we have only been looking at all (hyper)-parameter fitting/tuning combinations, and we have seen than even within a single model there is enormous differences in the model performance, even across and within our selected metrics. Hence we have not yet selected the best performing (hyper)-paramenter combinations, that result in best performing models. It is now time to do exactly that!

Note: We can update (or “finalize”) our model or workflow objects with settings or values from select_best().

Let us first finalize models and workflows from the non-tree based models as this is a little easier.

Finalizing model workflows for non-tree based models

##########################################################################
# Spatial k-fold cross validation

# Loading models and workflows
#-------------------------------------------------------
lm_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","lm_fit_spatcv.RDS"))
lm_wf <- readRDS(file = here::here("TidyMod_OUTPUT","lm_wf.RDS"))

glm_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","glm_fit_spatcv.RDS"))
glm_wf <- readRDS(file = here::here("TidyMod_OUTPUT","glm_wf.RDS"))

spline_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","spline_fit_spatcv.RDS"))
spline_wf <- readRDS(file = here::here("TidyMod_OUTPUT","spline_wf.RDS"))

knn_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","knn_fit_spatcv.RDS"))
knn_wf <- readRDS(file = here::here("TidyMod_OUTPUT","knn_wf.RDS"))

pca_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","pca_fit_spatcv.RDS"))
pca_wf <- readRDS(file = here::here("TidyMod_OUTPUT","pca_wf.RDS"))

svm_fit_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","svm_fit_spatcv.RDS"))
svm_wf <- readRDS(file = here::here("TidyMod_OUTPUT","svm_wf.RDS"))


# linear (lm) model 
#-------------------------------------------------------
# Best and last model 
lm_best_rmse_spatcv <- lm_fit_spatcv %>%     
  tune::select_best("rmse") 
# Last model workflow
lm_final_wf_rmse_spatcv <- lm_wf %>%
  finalize_workflow(lm_best_rmse_spatcv)


# glm model 
#-------------------------------------------------------
# Best and last model 
glm_best_rmse_spatcv <- glm_fit_spatcv %>%
  tune::select_best("rmse") 
# Last model workflow
glm_final_wf_rmse_spatcv <- glm_wf %>%
  finalize_workflow(glm_best_rmse_spatcv)


# spline model 
#-------------------------------------------------------
# Best and last model 
spline_best_rmse_spatcv <- spline_fit_spatcv %>%
  tune::select_best("rmse") 
# Last model workflow
spline_final_wf_rmse_spatcv <- spline_wf %>%
  finalize_workflow(spline_best_rmse_spatcv)


# knn model 
#-------------------------------------------------------
# Best and last model 
knn_best_rmse_spatcv <- knn_fit_spatcv %>%
  tune::select_best("rmse") 
# Last model workflow
knn_final_wf_rmse_spatcv <- knn_wf %>%
  finalize_workflow(knn_best_rmse_spatcv)


# pca model 
#-------------------------------------------------------
# Best and last model 
pca_best_rmse_spatcv <- pca_fit_spatcv %>%
  tune::select_best("rmse") 
# Last model workflow
pca_final_wf_rmse_spatcv <- pca_wf %>%
  finalize_workflow(pca_best_rmse_spatcv)


# svm model 
#-------------------------------------------------------
# Best and last model 
svm_best_rmse_spatcv <- svm_fit_spatcv %>%
  tune::select_best("rmse") 
# Last model workflow
svm_final_wf_rmse_spatcv <- svm_wf %>%
  finalize_workflow(svm_best_rmse_spatcv)


##########################################################################
# Normal (random) k-fold cross validation (CV-fold)

# Loading models and workflows
#-------------------------------------------------------
lm_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT","lm_fit_cv.RDS"))
#lm_wf <- readRDS(file = here::here("TidyMod_OUTPUT","lm_wf.RDS"))
glm_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT","glm_fit_cv.RDS"))
#glm_wf <- readRDS(file = here::here("TidyMod_OUTPUT","glm_wf.RDS"))
spline_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT","spline_fit_cv.RDS"))
#spline_wf <- readRDS(file = here::here("TidyMod_OUTPUT","spline_wf.RDS"))
knn_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT","knn_fit_cv.RDS"))
#knn_wf <- readRDS(file = here::here("TidyMod_OUTPUT","knn_wf.RDS"))
pca_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT","pca_fit_cv.RDS"))
#pca_wf <- readRDS(file = here::here("TidyMod_OUTPUT","pca_wf.RDS"))
svm_fit_cv <- readRDS(file = here::here("TidyMod_OUTPUT","svm_fit_cv.RDS"))
#svm_wf <- readRDS(file = here::here("TidyMod_OUTPUT","svm_wf.RDS"))


# linear (lm) model 
#-------------------------------------------------------
# Best and last model 
lm_best_rmse_cv <- lm_fit_cv %>%     
  tune::select_best("rmse") 
# Last model workflow
lm_final_wf_rmse_cv <- lm_wf %>%
  finalize_workflow(lm_best_rmse_cv)


# glm model 
#-------------------------------------------------------
# Best and last model 
glm_best_rmse_cv <- glm_fit_cv %>%
  tune::select_best("rmse") 
# Last model workflow
glm_final_wf_rmse_cv <- glm_wf %>%
  finalize_workflow(glm_best_rmse_cv)


# spline model 
#-------------------------------------------------------
# Best and last model 
spline_best_rmse_cv <- spline_fit_cv %>%
  tune::select_best("rmse") 
# Last model workflow
spline_final_wf_rmse_cv <- spline_wf %>%
  finalize_workflow(spline_best_rmse_cv)


# knn model 
#-------------------------------------------------------
# Best and last model 
knn_best_rmse_cv <- knn_fit_cv %>%
  tune::select_best("rmse") 
# Last model workflow
knn_final_wf_rmse_cv <- knn_wf %>%
  finalize_workflow(knn_best_rmse_cv)


# pca model 
#-------------------------------------------------------
# Best and last model 
pca_best_rmse_cv <- pca_fit_cv %>%
  tune::select_best("rmse") 
# Last model workflow
pca_final_wf_rmse_cv <- pca_wf %>%
  finalize_workflow(pca_best_rmse_cv)


# svm model 
#-------------------------------------------------------
# Best and last model 
svm_best_rmse_cv <- svm_fit_cv %>%
  tune::select_best("rmse") 
# Last model workflow
svm_final_wf_rmse_cv <- svm_wf %>%
  finalize_workflow(svm_best_rmse_cv)

Finalizing model workflows for the tree-based models

Setting up the final model workflows with the best model parameters.

################################################################################
# Spatial k-fold cross validation


# Collecting models with best performing hyper-parameter combinations ----------

rf_best_rmse_spatcv <- rf_param_fit_spatcv %>% show_best("rmse", n = 1)
xgboos_best_rmse_spatcv <- xgboost_param_fit_spatcv %>% show_best("rmse", n = 1)
catboost_best_rmse_spatcv <- catboost_param_fit_spatcv %>% show_best("rmse", n = 1)
lightgmb_best_rmse_spatcv <- lightgmb_param_fit_spatcv %>% show_best("rmse", n = 1)

# Final workflows with best performing model (hyper)-parameter specifications --

# Random forest
rf_final_wf_rmse_spatcv <- rf_wf %>% 
    finalize_workflow(select_best(rf_param_fit_spatcv))

# XGBoost
xgboost_final_wf_rmse_spatcv <- xgb_wf %>% 
    finalize_workflow(select_best(xgboost_param_fit_spatcv))

# CatBoost
catboost_final_wf_rmse_spatcv <- catboost_wf %>% 
    finalize_workflow(select_best(catboost_param_fit_spatcv))

# LightGMB
lightgmb_final_wf_rmse_spatcv <- lightgmb_wf %>% 
    finalize_workflow(select_best(lightgmb_param_fit_spatcv))


################################################################################
# Normal (random) k-fold cross validation (CV-fold)

# Collecting models with best performing hyper-parameter combinations ----------
rf_best_rmse_cv <- rf_param_fit_cv %>% show_best("rmse", n = 1)
xgboos_best_rmse_cv <- xgboost_param_fit_cv %>% show_best("rmse", n = 1)
catboost_best_rmse_cv <- catboost_param_fit_cv %>% show_best("rmse", n = 1)
lightgmb_best_rmse_cv <- lightgmb_param_fit_cv %>% show_best("rmse", n = 1)

# Creating final workflows with best performing models -------------------------

# Random forest
rf_final_wf_rmse_cv <- rf_wf %>% 
    finalize_workflow(select_best(rf_param_fit_cv))

# XGBoost
xgboost_final_wf_rmse_cv <- xgb_wf %>% 
    finalize_workflow(select_best(xgboost_param_fit_cv))

# CatBoost
catboost_final_wf_rmse_cv <- catboost_wf %>% 
    finalize_workflow(select_best(catboost_param_fit_cv))

# LightGMB
lightgmb_final_wf_rmse_cv <- lightgmb_wf %>% 
    finalize_workflow(select_best(lightgmb_param_fit_cv))

Finalizing all models

##########################################################################
# Spatial k-fold cross validation

lm_final_model_spatcv <- lm_model %>%
  finalize_model(lm_best_rmse_spatcv)

glm_final_model_spatcv <- glm_model %>%
  finalize_model(glm_best_rmse_spatcv)

spline_final_model_spatcv <- spline_model %>%
  finalize_model(spline_best_rmse_spatcv)

knn_final_model_spatcv <- knn_model %>%
  finalize_model(knn_best_rmse_spatcv)

pca_final_model_spatcv <- pca_model %>%
  finalize_model(pca_best_rmse_spatcv)

svm_final_model_spatcv <- svm_model %>%
  finalize_model(svm_best_rmse_spatcv)

rf_final_model_spatcv <- rf_model %>%
  finalize_model(rf_best_rmse_spatcv)

xgb_final_model_spatcv <- xgb_model %>%
  finalize_model(xgboos_best_rmse_spatcv)

catboost_final_model_spatcv <- catboost_model %>%
  finalize_model(catboost_best_rmse_spatcv)

lightgmb_final_model_spatcv <- lightgbm_model %>%
  finalize_model(lightgmb_best_rmse_spatcv)

##########################################################################
# Normal (random) k-fold cross validation (CV-fold)

lm_final_model_cv <- lm_model %>%
  finalize_model(lm_best_rmse_cv)

glm_final_model_cv <- glm_model %>%
  finalize_model(glm_best_rmse_cv)

spline_final_model_cv <- spline_model %>%
  finalize_model(spline_best_rmse_cv)

knn_final_model_cv <- knn_model %>%
  finalize_model(knn_best_rmse_cv)

pca_final_model_cv <- pca_model %>%
  finalize_model(pca_best_rmse_cv)

svm_final_model_cv <- svm_model %>%
  finalize_model(svm_best_rmse_cv)

rf_final_model_cv <- rf_model %>%
  finalize_model(rf_best_rmse_cv)

xgb_final_model_cv <- xgb_model %>%
  finalize_model(xgboos_best_rmse_cv)

catboost_final_model_cv <- catboost_model %>%
  finalize_model(catboost_best_rmse_cv)

lightgmb_final_model_cv <- lightgbm_model %>%
  finalize_model(lightgmb_best_rmse_cv)

Saving finalized models

STEP 9: Perform last model fitting using all data (training + testing)

set.seed(4567)

################################################################################
# Spatial k-fold cross validation

# lm
last_lm_fit_spatcv_fold <- lm_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

# glm
last_glm_fit_spatcv_fold <- glm_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

# spline
last_spline_fit_spatcv_fold <- spline_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

# knn
last_knn_fit_spatcv_fold <- knn_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

# pca
last_pca_fit_spatcv_fold <- pca_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

# svm
last_svm_fit_spatcv_fold <- svm_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

# Random forest
last_rf_fit_spatcv_fold <- rf_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

# XGBoost
last_xgb_fit_spatcv_fold <- xgboost_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

# CatBoost
last_catboost_fit_spatcv_fold <- catboost_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

# LightGMB
last_lightgmb_fit_spatcv_fold <- lightgmb_final_wf_rmse_spatcv %>% 
  last_fit(af.split)

################################################################################
# Normal (random) k-fold cross validation (CV-fold)

# lm
last_lm_fit_cv_fold <- lm_final_wf_rmse_cv %>% 
  last_fit(af.split)

# glm
last_glm_fit_cv_fold <- glm_final_wf_rmse_cv %>% 
  last_fit(af.split)

# spline
last_spline_fit_cv_fold <- spline_final_wf_rmse_cv %>% 
  last_fit(af.split)

# knn
last_knn_fit_cv_fold <- knn_final_wf_rmse_cv %>% 
  last_fit(af.split)

# pca
last_pca_fit_cv_fold <- pca_final_wf_rmse_cv %>% 
  last_fit(af.split)

# svm
last_svm_fit_cv_fold <- svm_final_wf_rmse_cv %>% 
  last_fit(af.split)

# Random forest
last_rf_fit_cv_fold <- rf_final_wf_rmse_cv %>% 
  last_fit(af.split)

# XGBoost
last_xgb_fit_cv_fold <- xgboost_final_wf_rmse_cv %>% 
  last_fit(af.split)

# CatBoost
last_catboost_fit_cv_fold <- catboost_final_wf_rmse_cv %>% 
  last_fit(af.split)

# LightGMB
last_lightgmb_fit_cv_fold <- lightgmb_final_wf_rmse_cv %>% 
  last_fit(af.split)

Saving last model fittings

##########################################################################
# Saving models made with spatial k-fold cross validation

#lm 
saveRDS(last_lm_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_lm_fit_spatcv_fold.RDS"))

# glm
saveRDS(last_glm_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_glm_fit_spatcv_fold.RDS"))

# spline
saveRDS(last_spline_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_spline_fit_spatcv_fold.RDS"))

# knn
saveRDS(last_knn_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_knn_fit_spatcv_fold.RDS"))

# pca
saveRDS(last_pca_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_pca_fit_spatcv_fold.RDS"))

# svm
saveRDS(last_svm_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_svm_fit_spatcv_fold.RDS"))

# rf_param_fit_spatcv
saveRDS(last_rf_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_rf_fit_spatcv_fold.RDS"))

# xgboost_param_fit_spatcv
saveRDS(last_xgb_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_xgb_fit_spatcv_fold.RDS"))

# catboost_res_param_fit
saveRDS(last_catboost_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_catboost_fit_spatcv_fold.RDS"))

# lightgmb_param_fit_spatcv
saveRDS(last_lightgmb_fit_spatcv_fold, here::here("TidyMod_OUTPUT","last_lightgmb_fit_spatcv_fold.RDS"))

##########################################################################
# Saving models made with normal (random) k-fold cross validation (CV-fold)

#lm 
saveRDS(last_lm_fit_cv_fold, here::here("TidyMod_OUTPUT","last_lm_fit_cv_fold.RDS"))

# glm
saveRDS(last_glm_fit_cv_fold, here::here("TidyMod_OUTPUT","last_glm_fit_cv_fold.RDS"))

# spline
saveRDS(last_spline_fit_cv_fold, here::here("TidyMod_OUTPUT","last_spline_fit_cv_fold.RDS"))

# knn
saveRDS(last_knn_fit_cv_fold, here::here("TidyMod_OUTPUT","last_knn_fit_cv_fold.RDS"))

# pca
saveRDS(last_pca_fit_cv_fold, here::here("TidyMod_OUTPUT","last_pca_fit_cv_fold.RDS"))

# svm
saveRDS(last_svm_fit_cv_fold, here::here("TidyMod_OUTPUT","last_svm_fit_cv_fold.RDS"))

# rf_param_fit_spatcv
saveRDS(last_rf_fit_cv_fold, here::here("TidyMod_OUTPUT","last_rf_fit_cv_fold.RDS"))

# xgboost_param_fit_spatcv
saveRDS(last_xgb_fit_cv_fold, here::here("TidyMod_OUTPUT","last_xgb_fit_cv_fold.RDS"))

# catboost_res_param_fit
saveRDS(last_catboost_fit_cv_fold, here::here("TidyMod_OUTPUT","last_catboost_fit_cv_fold.RDS"))

# lightgmb_param_fit_spatcv
saveRDS(last_lightgmb_fit_cv_fold, here::here("TidyMod_OUTPUT","last_lightgmb_fit_cv_fold.RDS"))

STEP 10: Model validation, evaluation and performance

Cenerating variable importance plots

##############################################################################################################
# Spatial k-fold cross validation

# Loading the last fit models 
#-------------------------------------------------------
last_lm_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_lm_fit_spatcv_fold.RDS"))
last_glm_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_glm_fit_spatcv_fold.RDS"))
last_spline_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_spline_fit_spatcv_fold.RDS"))
last_knn_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_knn_fit_spatcv_fold.RDS"))
last_pca_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_pca_fit_spatcv_fold.RDS"))
last_svm_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_svm_fit_spatcv_fold.RDS"))
last_rf_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_rf_fit_spatcv_fold.RDS"))
last_xgb_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_xgb_fit_spatcv_fold.RDS"))
last_catboost_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_catboost_fit_spatcv_fold.RDS"))
last_lightgmb_fit_spatcv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_lightgmb_fit_spatcv_fold.RDS"))

#-------------------------------------------------------
# lm
vip_lm_spatcv <- last_lm_fit_spatcv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: lm with spatial clustering cv-folds")

#-------------------------------------------------------
# glm
vip_glm_spatcv <- last_glm_fit_spatcv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: glm with spatial clustering cv-folds")

#-------------------------------------------------------
# spline
vip_spline_spatcv <- last_spline_fit_spatcv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: spline with spatial clustering cv-folds")

#-------------------------------------------------------
# knn
# vip_knn_spatcv <- last_knn_fit_spatcv_fold %>% 
#   pluck(".workflow", 1) %>%   
#   pull_workflow_fit() %>% 
#   vip(num_features = 20) +
#   theme_lucid() +
#   ggtitle("Variable importance, model: knn with spatial clustering cv-folds")

# Error: Model-specific variable importance scores are currently not available for this type of model.

#-------------------------------------------------------
# pca
vip_pca_spatcv <- last_pca_fit_spatcv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: pca with spatial clustering cv-folds")

#-------------------------------------------------------
# svm
# vip_svm_spatcv <- last_svm_fit_spatcv_fold %>% 
#   pluck(".workflow", 1) %>%   
#   pull_workflow_fit() %>% 
#   vip(num_features = 20) +
#   theme_lucid() +
#   ggtitle("Variable importance, model: svm with spatial clustering cv-folds")

# Error: Model-specific variable importance scores are currently not available for this type of model.

#-------------------------------------------------------
# Random forest
vip_rf_spatcv <- last_rf_fit_spatcv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: RF with spatial clustering cv-folds")

#-------------------------------------------------------
# XGBoost 
vip_xgboost_spatcv <- last_xgb_fit_spatcv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: XGB with spatial clustering cv-folds")

#-------------------------------------------------------
# CatBoost
# vip_catboost_spatcv <- last_catboost_fit_spatcv_fold %>%
#   pluck(".workflow", 1) %>%
#   pull_workflow_fit() %>%
#   vip(num_features = 20) +
#   theme_lucid() +
#   ggtitle("Variable importance, model: CatBoost with spatial clustering cv-folds")

# Error: Model-specific variable importance scores are currently not available for this type of model.

#-------------------------------------------------------
# LightGMB
# vip_lightgmb_spatcv <- last_lightgmb_fit_spatcv_fold %>%
#   pluck(".workflow", 1) %>%
#   pull_workflow_fit() %>%
#   vip(num_features = 20) +
#   theme_lucid() +
#   ggtitle("Variable importance, model: LightGMB with spatial clustering cv-folds")

# Error: Model-specific variable importance scores are currently not available for this type of model.

##############################################################################################################
# Normal (random) k-fold cross validation (CV-fold)

# Loading the last fit models 
#-------------------------------------------------------
last_lm_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_lm_fit_cv_fold.RDS"))
last_glm_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_glm_fit_cv_fold.RDS"))
last_spline_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_spline_fit_cv_fold.RDS"))
last_knn_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_knn_fit_cv_fold.RDS"))
last_pca_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_pca_fit_cv_fold.RDS"))
last_svm_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_svm_fit_cv_fold.RDS"))
last_rf_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_rf_fit_cv_fold.RDS"))
last_xgb_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_xgb_fit_cv_fold.RDS"))
last_catboost_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_catboost_fit_cv_fold.RDS"))
last_lightgmb_fit_cv_fold <- readRDS(file = here::here("TidyMod_OUTPUT","last_lightgmb_fit_cv_fold.RDS"))

#-------------------------------------------------------
# lm
vip_lm_cv <- last_lm_fit_cv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: lm with normal cv-folds")

#-------------------------------------------------------
# glm
vip_glm_cv <- last_glm_fit_cv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: glm with normal cv-folds")

#-------------------------------------------------------
# spline
vip_spline_cv <- last_spline_fit_cv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: spline with normal cv-folds")

#-------------------------------------------------------
# knn
# vip_knn_cv <- last_knn_fit_cv_fold %>%
#   pluck(".workflow", 1) %>%
#   pull_workflow_fit() %>%
#   vip(num_features = 20) +
#   theme_lucid() +
#   ggtitle("Variable importance, model: knn with normal cv-folds")

# Error: Model-specific variable importance scores are currently not available for this type of model.

#-------------------------------------------------------
# pca
vip_pca_cv <- last_pca_fit_cv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: pca with normal cv-folds")

#-------------------------------------------------------
# svm
# vip_svm_cv <- last_svm_fit_cv_fold %>% 
#   pluck(".workflow", 1) %>%   
#   pull_workflow_fit() %>% 
#   vip(num_features = 20) +
#   theme_lucid() +
#   ggtitle("Variable importance, model: svm with normal cv-folds")

# Error: Model-specific variable importance scores are currently not available for this type of model.

#-------------------------------------------------------
# Random forest
vip_rf_cv <- last_rf_fit_cv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: RF with normal cv-folds")

#-------------------------------------------------------
# XGBoost 
vip_xgboost_cv <- last_xgb_fit_cv_fold %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20) +
  theme_lucid() +
  ggtitle("Variable importance, model: XGB with normal cv-folds")

#-------------------------------------------------------
# CatBoost
# vip_catboost_cv <- last_catboost_fit_cv_fold %>% 
#   pluck(".workflow", 1) %>%   
#   pull_workflow_fit() %>% 
#   vip(num_features = 20) +
#   theme_lucid() +
#   ggtitle("Variable importance, model: CatBoost with normal cv-folds")

# Error: Model-specific variable importance scores are currently not available for this type of model.

#-------------------------------------------------------
# LightGMB
# vip_lightgmb_cv <- last_lightgmb_fit_cv_fold %>%
#   pluck(".workflow", 1) %>%
#   pull_workflow_fit() %>%
#   vip(num_features = 20) +
#   theme_lucid() +
#   ggtitle("Variable importance, model: LightGMB with normal cv-folds")

# Error: Model-specific variable importance scores are currently not available for this type of model.

#-------------------------------------------------------
# SAVING VIP PLOTS
#-------------------------------------------------------

Saving variable importance plots

##############################################################################################################
# Spatial k-fold cross validation
saveRDS(vip_lm_spatcv, here::here("TidyMod_OUTPUT","vip_lm_spatcv.RDS"))
saveRDS(vip_glm_spatcv, here::here("TidyMod_OUTPUT","vip_glm_spatcv.RDS"))
saveRDS(vip_spline_spatcv, here::here("TidyMod_OUTPUT","vip_spline_spatcv.RDS"))
saveRDS(vip_pca_spatcv, here::here("TidyMod_OUTPUT","vip_pca_spatcv.RDS"))
saveRDS(vip_rf_spatcv, here::here("TidyMod_OUTPUT","vip_rf_spatcv.RDS"))
saveRDS(vip_xgboost_spatcv, here::here("TidyMod_OUTPUT","vip_xgboost_spatcv.RDS"))

##############################################################################################################
# Normal (random) k-fold cross validation (CV-fold)
saveRDS(vip_lm_cv, here::here("TidyMod_OUTPUT","vip_lm_cv.RDS"))
saveRDS(vip_glm_cv, here::here("TidyMod_OUTPUT","vip_glm_cv.RDS"))
saveRDS(vip_spline_cv, here::here("TidyMod_OUTPUT","vip_spline_cv.RDS"))
saveRDS(vip_pca_cv, here::here("TidyMod_OUTPUT","vip_pca_cv.RDS"))
saveRDS(vip_rf_cv, here::here("TidyMod_OUTPUT","vip_rf_cv.RDS"))
saveRDS(vip_xgboost_cv, here::here("TidyMod_OUTPUT","vip_xgboost_cv.RDS"))

Visualizing variable importance plots

VIP for models with spatial clustering cv-folds

Show code
vip_lm_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_lm_spatcv.RDS"))
vip_lm_spatcv
Variable importance plot, model: lm, resampling: spatial clustering cv-folds

Figure 24: Variable importance plot, model: lm, resampling: spatial clustering cv-folds

Show code
vip_glm_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_glm_spatcv.RDS"))
vip_glm_spatcv
Variable importance plot, model: glm, resampling: spatial clustering cv-folds

Figure 25: Variable importance plot, model: glm, resampling: spatial clustering cv-folds

Show code
vip_spline_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_spline_spatcv.RDS"))
vip_spline_spatcv
Variable importance plot, model: Spline, resampling: spatial clustering cv-folds

Figure 26: Variable importance plot, model: Spline, resampling: spatial clustering cv-folds

Show code
vip_rf_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_rf_spatcv.RDS"))
vip_rf_spatcv
Variable importance plot, model: Random forest, resampling: spatial clustering cv-folds

Figure 27: Variable importance plot, model: Random forest, resampling: spatial clustering cv-folds

Show code
vip_xgboost_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_xgboost_spatcv.RDS"))
vip_xgboost_spatcv
Variable importance plot, model: XGBoost, resampling: spatial clustering cv-folds

Figure 28: Variable importance plot, model: XGBoost, resampling: spatial clustering cv-folds

VIP for models with normal cv-folds

Show code
vip_lm_cv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_lm_cv.RDS"))
vip_lm_cv
Variable importance plot, model: lm, resampling: Normal cv-folds

Figure 29: Variable importance plot, model: lm, resampling: Normal cv-folds

Show code
vip_glm_cv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_glm_cv.RDS"))
vip_glm_cv
Variable importance plot, model: glm, resampling: Normal cv-folds

Figure 30: Variable importance plot, model: glm, resampling: Normal cv-folds

Show code
vip_spline_cv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_spline_cv.RDS"))
vip_spline_cv
Variable importance plot, model: Spline, resampling: Normal cv-folds

Figure 31: Variable importance plot, model: Spline, resampling: Normal cv-folds

Show code
vip_rf_cv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_rf_cv.RDS"))
vip_rf_cv
Variable importance plot, model: Random forest, resampling: Normal cv-folds

Figure 32: Variable importance plot, model: Random forest, resampling: Normal cv-folds

Show code
vip_xgboost_cv <- readRDS(file = here::here("TidyMod_OUTPUT","vip_xgboost_cv.RDS"))
vip_xgboost_cv
Variable importance plot, model: XGBoost, resampling: Normal cv-folds

Figure 33: Variable importance plot, model: XGBoost, resampling: Normal cv-folds

Yardstick model evaluation

The yardstck package is part of the tidymodels meta-package and is made specifically to estimate how well models are working using tidy data principles. We are going to use the yardstick package functions to

##############################################################################################################
# Spatial k-fold cross validation
##############################################################################################################

# linear model (lm) --##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##-- 

# Prep the recipe in order to make the recipe ready to bake on the training and test data
lm_recipe_prepped <- prep(lm_recipe)


# Evaluate model on the training data ----------------------------------------------------------------------
lm_train_baked <- bake(lm_recipe_prepped,  new_data = training(af.split))

lm_fit_spatcv_train_pred <- lm_final_model_spatcv %>%      # fit the normal cv_folds model on all the training data
  fit(formula = logRR ~ . ,data = lm_train_baked) %>%      # predict logRR for the training data
  predict(new_data = lm_train_baked) %>%
  bind_cols(training(af.split)) 
# Evaluating model by measuring the accuracy of the model on training using `yardstick`
lm_spatcv_score_train <- lm_fit_spatcv_train_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Evaluate model on the testing data  -----------------------------------------------------------------------
lm_test_baked <- bake(lm_recipe_prepped,  new_data = testing(af.split))

lm_fit_spatcv_test_pred <- lm_final_model_spatcv %>%                      # fit the normal cv_folds model on all the test data
  fit(formula = logRR ~ ., data    = lm_test_baked) %>%       # use the training model fit to predict the test data
  predict(new_data = lm_test_baked) %>%
  bind_cols(testing(af.split))
# Evaluating model by measuring the accuracy of the model on testing data using `yardstick`
lm_spatcv_score_test <- lm_fit_spatcv_test_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Visualising residuals
lm_spatcv_pred_resid <- 
  lm_fit_spatcv_test_pred %>%
  arrange(.pred) %>%
  mutate(residual_pct = (logRR - .pred) / .pred) %>%
  select(.pred, residual_pct) %>%
  ggplot(aes(x = .pred, y = residual_pct)) +
  geom_point() +
  xlab("Predicted logRR") +
  ylab("Residual (%)") 

# -----------------------------------------------
  
lm_spatcv_test_pred <- 
predict(lm_final_model_spatcv %>% 
          fit(formula = logRR ~ ., 
              data = lm_test_baked), 
        new_data = lm_test_baked) %>%
  bind_cols(af.test %>% dplyr::select(logRR))

saveRDS(lm_spatcv_test_pred, file = here::here("TidyMod_OUTPUT", "lm_spatcv_test_pred.RDS"))
# lm_spatcv_test_pred %>% rmse(logRR, .pred)

# Visualising predicted vs. observed logRR plot

lm_spatcv_test_pred_hexplot <- 
  lm_spatcv_test_pred %>% 
  ggplot(aes(x = logRR, y = .pred)) +
  geom_abline(col = "black", lty = 1, size = 1) +
  geom_hex(alpha = 0.8, bins = 50) +
  geom_jitter(size = 0.8, width = 0.05, height = 0.05, shape = 3, alpha = 0.6) +
  coord_fixed(ratio = 0.8) +
  xlab("Observed logRR") +
  ylab("Predicted logRR") +
  scale_fill_viridis() + 
  ggtitle("Pred vs. obs logRR, model: lm with spatial clustering cv-fold") +
  theme_lucid(plot.title.size = 10)

saveRDS(lm_spatcv_test_pred_hexplot, file = here::here("TidyMod_OUTPUT", "lm_spatcv_test_pred_hexplot.RDS"))

# Spline --##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##-- 

# Prep the recipe in order to make the recipe ready to bake on the training and test data
spline_recipe_prepped <- prep(spline_recipe)

# Evaluate model on the training data ----------------------------------------------------------------------
spline_train_baked <- bake(spline_recipe_prepped,  new_data = training(af.split))


spline_fit_spatcv_train_pred <- spline_final_model_spatcv %>%      # fit the normal cv_folds model on all the training data
  fit(formula = logRR ~ . ,data = spline_train_baked) %>%      # predict logRR for the training data
  predict(new_data = spline_train_baked) %>%
  bind_cols(training(af.split)) 
# Evaluating model by measuring the accuracy of the model on training using `yardstick`
spline_spatcv_score_train <- spline_fit_spatcv_train_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Evaluate model on the testing data  -----------------------------------------------------------------------
spline_test_baked <- bake(spline_recipe_prepped,  new_data = testing(af.split))

spline_fit_spatcv_test_pred <- spline_final_model_spatcv %>%      # fit the normal cv_folds model on all the test data
  fit(formula = logRR ~ ., data    = spline_test_baked) %>%       # use the training model fit to predict the test data
  predict(new_data = spline_test_baked) %>%
  bind_cols(testing(af.split))

# Evaluating model by measuring the accuracy of the model on testing data using `yardstick`
spline_spatcv_score_test <- spline_fit_spatcv_test_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Visualising residuals
spline_spatcv_pred_resid <- 
  spline_fit_spatcv_test_pred %>%
  arrange(.pred) %>%
  mutate(residual_pct = (logRR - .pred) / .pred) %>%
  select(.pred, residual_pct) %>%
  ggplot(aes(x = .pred, y = residual_pct)) +
  geom_point() +
  xlab("Predicted logRR") +
  ylab("Residual (%)") 

# -----------------------------------------------
  
spline_spatcv_test_pred <- 
predict(spline_final_model_spatcv %>% 
          fit(formula = logRR ~ ., 
              data = spline_test_baked), 
        new_data = spline_test_baked) %>%
  bind_cols(af.test %>% dplyr::select(logRR))

saveRDS(spline_spatcv_test_pred, file = here::here("TidyMod_OUTPUT", "spline_spatcv_test_pred.RDS"))
# spline_spatcv_test_pred %>% rmse(logRR, .pred)

spline_spatcv_test_pred_hexplot <- 
  spline_spatcv_test_pred %>% 
  ggplot(aes(x = logRR, y = .pred)) +
  geom_abline(col = "black", lty = 1, size = 1) +
  geom_hex(alpha = 0.8, bins = 50) +
  geom_jitter(size = 0.8, width = 0.05, height = 0.05, shape = 3, alpha = 0.6) +
  coord_fixed(ratio = 0.8) +
  xlab("Observed logRR") +
  ylab("Predicted logRR") +
  scale_fill_viridis() + 
  ggtitle("Pred vs. obs logRR, model: Spline with spatial clustering cv-fold") +
  theme_lucid(plot.title.size = 10)

saveRDS(spline_spatcv_test_pred_hexplot, file = here::here("TidyMod_OUTPUT", "spline_spatcv_test_pred_hexplot.RDS"))

# Random forest (RF) --##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##-- 

# Prep the recipe in order to make the recipe ready to bake on the training and test data
rf_xgboost_recipe_prepped <- prep(rf_xgboost_recipe)

# Evaluate model on the training data ----------------------------------------------------------------------
rf_xgboost_train_baked <- bake(rf_xgboost_recipe_prepped,  new_data = training(af.split))


rf_fit_spatcv_train_pred <- rf_final_model_spatcv %>%      # fit the normal cv_folds model on all the training data
  fit(formula = logRR ~ . ,data = rf_xgboost_train_baked) %>%      # predict logRR for the training data
  predict(new_data = rf_xgboost_train_baked) %>%
  bind_cols(training(af.split)) 

# Evaluating model by measuring the accuracy of the model on training using `yardstick`
rf_spatcv_score_train <- rf_fit_spatcv_train_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Evaluate model on the testing data  -----------------------------------------------------------------------
rf_xgboost_test_baked <- bake(rf_xgboost_recipe_prepped,  new_data = testing(af.split))

rf_fit_spatcv_test_pred <- rf_final_model_spatcv %>%                      # fit the normal cv_folds model on all the test data
  fit(formula = logRR ~ ., data    = rf_xgboost_test_baked) %>%       # use the training model fit to predict the test data
  predict(new_data = rf_xgboost_test_baked) %>%
  bind_cols(testing(af.split))

# Evaluating model by measuring the accuracy of the model on testing data using `yardstick`
rf_spatcv_score_test <- rf_fit_spatcv_test_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Visualising residuals
rf_spatcv_pred_resid <- 
  rf_fit_spatcv_test_pred %>%
  arrange(.pred) %>%
  mutate(residual_pct = (logRR - .pred) / .pred) %>%
  select(.pred, residual_pct) %>%
  ggplot(aes(x = .pred, y = residual_pct)) +
  geom_point() +
  xlab("Predicted logRR") +
  ylab("Residual (%)") 

# -----------------------------------------------
  
rf_spatcv_test_pred <- 
  predict(rf_final_model_spatcv %>% 
  fit(formula = logRR ~ ., data = rf_xgboost_test_baked), 
        new_data = rf_xgboost_test_baked) %>%
  bind_cols(af.test %>% dplyr::select(logRR))


# rf_spatcv_test_pred %>% rmse(logRR, .pred)
saveRDS(rf_spatcv_test_pred, file = here::here("TidyMod_OUTPUT", "rf_spatcv_test_pred.RDS"))

rf_spatcv_test_pred_hexplot <- 
  rf_spatcv_test_pred %>% 
  ggplot(aes(x = logRR, y = .pred)) +
  geom_abline(col = "black", lty = 1, size = 1) +
  geom_hex(alpha = 0.8, bins = 50) +
  geom_jitter(size = 0.8, width = 0.05, height = 0.05, shape = 3, alpha = 0.6) +
  coord_fixed(ratio = 0.8) +
  xlab("Observed logRR") +
  ylab("Predicted logRR") +
  scale_fill_viridis() + 
  ggtitle("Pred vs. obs logRR, model: RF with spatial clustering cv-fold") +
  theme_lucid(plot.title.size = 10)

saveRDS(rf_spatcv_test_pred_hexplot, file = here::here("TidyMod_OUTPUT", "rf_spatcv_test_pred_hexplot.RDS"))

# XGBoost (xgb) --##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##-- 


xgb_fit_spatcv_train_pred <- xgb_final_model_spatcv %>%      # fit the normal cv_folds model on all the training data
  fit(formula = logRR ~ . ,data = rf_xgboost_train_baked) %>%      # predict logRR for the training data
  predict(new_data = rf_xgboost_train_baked) %>%
  bind_cols(training(af.split)) 

# Evaluating model by measuring the accuracy of the model on training using `yardstick`
xgb_spatcv_score_train <- xgb_fit_spatcv_train_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Evaluate model on the testing data  -----------------------------------------------------------------------

xgb_fit_spatcv_test_pred <- xgb_final_model_spatcv %>%                      # fit the normal cv_folds model on all the test data
  fit(formula = logRR ~ ., data    = rf_xgboost_test_baked) %>%       # use the training model fit to predict the test data
  predict(new_data = rf_xgboost_test_baked) %>%
  bind_cols(testing(af.split))

# Evaluating model by measuring the accuracy of the model on testing data using `yardstick`
xgb_spatcv_score_test <- xgb_fit_spatcv_test_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Visualising residuals
xgb_spatcv_pred_resid <- 
  xgb_fit_spatcv_test_pred %>%
  arrange(.pred) %>%
  mutate(residual_pct = (logRR - .pred) / .pred) %>%
  select(.pred, residual_pct) %>%
  ggplot(aes(x = .pred, y = residual_pct)) +
  geom_point() +
  xlab("Predicted logRR") +
  ylab("Residual (%)") 

# -----------------------------------------------
  
xgb_spatcv_test_pred <- 
  predict(xgb_final_model_spatcv %>% 
  fit(formula = logRR ~ ., data = rf_xgboost_test_baked), 
        new_data = rf_xgboost_test_baked) %>%
  bind_cols(af.test %>% dplyr::select(logRR))


# xgb_spatcv_test_pred %>% rmse(logRR, .pred)
saveRDS(xgb_spatcv_test_pred, file = here::here("TidyMod_OUTPUT", "xgb_spatcv_test_pred.RDS"))

xgb_spatcv_test_pred_hexplot <- 
  xgb_spatcv_test_pred %>% 
  ggplot(aes(x = logRR, y = .pred)) +
  geom_abline(col = "black", lty = 1, size = 1) +
  geom_hex(alpha = 0.8, bins = 50) +
  geom_jitter(size = 0.8, width = 0.05, height = 0.05, shape = 3, alpha = 0.6) +
  coord_fixed(ratio = 10) + # <- NOTICE the change in plot coord because something wiered is happening with xgb spatcv
  xlab("Observed logRR") +
  ylab("Predicted logRR") +
  scale_fill_viridis() + 
  ggtitle("Pred vs. obs logRR, model: xgb with spatial clustering cv-fold") +
  theme_lucid(plot.title.size = 10)

saveRDS(xgb_spatcv_test_pred_hexplot, file = here::here("TidyMod_OUTPUT", "xgb_spatcv_test_pred_hexplot.RDS"))




##############################################################################################################
# Normal (random) k-fold cross validation (CV-fold)
##############################################################################################################

# linear model (lm) --##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##-- 

# Prep the recipe in order to make the recipe ready to bake on the training and test data
lm_recipe_prepped <- prep(lm_recipe)


# Evaluate model on the training data ----------------------------------------------------------------------
lm_train_baked <- bake(lm_recipe_prepped,  new_data = training(af.split))

lm_fit_cv_train_pred <- lm_final_model_cv %>%      # fit the normal cv_folds model on all the training data
  fit(formula = logRR ~ . ,data = lm_train_baked) %>%      # predict logRR for the training data
  predict(new_data = lm_train_baked) %>%
  bind_cols(training(af.split)) 

# Evaluating model by measuring the accuracy of the model on training using `yardstick`
lm_cv_score_train <- lm_fit_cv_train_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Evaluate model on the testing data  -----------------------------------------------------------------------
lm_test_baked <- bake(lm_recipe_prepped,  new_data = testing(af.split))

lm_fit_cv_test_pred <- lm_final_model_cv %>%                      # fit the normal cv_folds model on all the test data
  fit(formula = logRR ~ ., data    = lm_test_baked) %>%       # use the training model fit to predict the test data
  predict(new_data = lm_test_baked) %>%
  bind_cols(testing(af.split))

# Evaluating model by measuring the accuracy of the model on testing data using `yardstick`
lm_cv_score_test <- lm_fit_cv_test_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Visualising residuals
lm_cv_pred_resid <- 
  lm_fit_cv_test_pred %>%
  arrange(.pred) %>%
  mutate(residual_pct = (logRR - .pred) / .pred) %>%
  select(.pred, residual_pct) %>%
  ggplot(aes(x = .pred, y = residual_pct)) +
  geom_point() +
  xlab("Predicted logRR") +
  ylab("Residual (%)") 

# -----------------------------------------------
  
lm_cv_test_pred <- 
predict(lm_final_model_cv %>% 
          fit(formula = logRR ~ ., 
              data = lm_test_baked), 
        new_data = lm_test_baked) %>%
  bind_cols(af.test %>% dplyr::select(logRR))


# lm_cv_test_pred %>% rmse(logRR, .pred)
saveRDS(lm_cv_test_pred, file = here::here("TidyMod_OUTPUT", "lm_cv_test_pred.RDS"))

lm_cv_test_pred_hexplot <- 
  lm_cv_test_pred %>% 
  ggplot(aes(x = logRR, y = .pred)) +
  geom_abline(col = "black", lty = 1, size = 1) +
  geom_hex(alpha = 0.8, bins = 50) +
  geom_jitter(size = 0.8, width = 0.05, height = 0.05, shape = 3, alpha = 0.6) +
  coord_fixed(ratio = 0.8) +
  xlab("Observed logRR") +
  ylab("Predicted logRR") +
  scale_fill_viridis() + 
  ggtitle("Pred vs. obs logRR, model: lm with normal cv-fold") +
  theme_lucid(plot.title.size = 10)


saveRDS(lm_cv_test_pred_hexplot, file = here::here("TidyMod_OUTPUT", "lm_cv_test_pred_hexplot.RDS"))


# Spline --##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##-- 

# Prep the recipe in order to make the recipe ready to bake on the training and test data
spline_recipe_prepped <- prep(spline_recipe)

# Evaluate model on the training data ----------------------------------------------------------------------
spline_train_baked <- bake(spline_recipe_prepped,  new_data = training(af.split))


spline_fit_cv_train_pred <- spline_final_model_cv %>%      # fit the normal cv_folds model on all the training data
  fit(formula = logRR ~ . ,data = spline_train_baked) %>%      # predict logRR for the training data
  predict(new_data = spline_train_baked) %>%
  bind_cols(training(af.split)) 

# Evaluating model by measuring the accuracy of the model on training using `yardstick`
spline_cv_score_train <- spline_fit_cv_train_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Evaluate model on the testing data  -----------------------------------------------------------------------
spline_test_baked <- bake(spline_recipe_prepped,  new_data = testing(af.split))

spline_fit_cv_test_pred <- spline_final_model_cv %>%      # fit the normal cv_folds model on all the test data
  fit(formula = logRR ~ ., data    = spline_test_baked) %>%       # use the training model fit to predict the test data
  predict(new_data = spline_test_baked) %>%
  bind_cols(testing(af.split))

# Evaluating model by measuring the accuracy of the model on testing data using `yardstick`
spline_cv_score_test <- spline_fit_cv_test_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Visualising residuals
spline_cv_pred_resid <- 
  spline_fit_cv_test_pred %>%
  arrange(.pred) %>%
  mutate(residual_pct = (logRR - .pred) / .pred) %>%
  select(.pred, residual_pct) %>%
  ggplot(aes(x = .pred, y = residual_pct)) +
  geom_point() +
  xlab("Predicted logRR") +
  ylab("Residual (%)") 

# -----------------------------------------------
  
spline_cv_test_pred <- 
predict(spline_final_model_cv %>% 
          fit(formula = logRR ~ ., 
              data = spline_test_baked), 
        new_data = spline_test_baked) %>%
  bind_cols(af.test %>% dplyr::select(logRR))


# spline_cv_test_pred %>% rmse(logRR, .pred)
saveRDS(spline_cv_test_pred, file = here::here("TidyMod_OUTPUT", "spline_cv_test_pred.RDS"))

spline_cv_test_pred_hexplot <- 
  spline_cv_test_pred %>% 
  ggplot(aes(x = logRR, y = .pred)) +
  geom_abline(col = "black", lty = 1, size = 1) +
  geom_hex(alpha = 0.8, bins = 50) +
  geom_jitter(size = 0.8, width = 0.05, height = 0.05, shape = 3, alpha = 0.6) +
  coord_fixed(ratio = 0.8) +
  xlab("Observed logRR") +
  ylab("Predicted logRR") +
  scale_fill_viridis() + 
  ggtitle("Pred vs. obs logRR, model: Spline with normal cv-fold") +
  theme_lucid(plot.title.size = 10)

saveRDS(spline_cv_test_pred_hexplot, file = here::here("TidyMod_OUTPUT", "spline_cv_test_pred_hexplot.RDS"))


# Random forest (RF) --##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##-- 

# Prep the recipe in order to make the recipe ready to bake on the training and test data
rf_xgboost_recipe_prepped <- prep(rf_xgboost_recipe)

# Evaluate model on the training data ----------------------------------------------------------------------
rf_xgboost_train_baked <- bake(rf_xgboost_recipe_prepped,  new_data = training(af.split))


rf_fit_cv_train_pred <- rf_final_model_cv %>%      # fit the normal cv_folds model on all the training data
  fit(formula = logRR ~ . ,data = rf_xgboost_train_baked) %>%      # predict logRR for the training data
  predict(new_data = rf_xgboost_train_baked) %>%
  bind_cols(training(af.split)) 

# Evaluating model by measuring the accuracy of the model on training using `yardstick`
rf_cv_score_train <- rf_fit_cv_train_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Evaluate model on the testing data  -----------------------------------------------------------------------
rf_xgboost_test_baked <- bake(rf_xgboost_recipe_prepped,  new_data = testing(af.split))

rf_fit_cv_test_pred <- rf_final_model_cv %>%                      # fit the normal cv_folds model on all the test data
  fit(formula = logRR ~ ., data    = rf_xgboost_test_baked) %>%       # use the training model fit to predict the test data
  predict(new_data = rf_xgboost_test_baked) %>%
  bind_cols(testing(af.split))

# Evaluating model by measuring the accuracy of the model on testing data using `yardstick`
rf_cv_score_test <- rf_fit_cv_test_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Visualising residuals
rf_cv_pred_resid <- 
  rf_fit_cv_test_pred %>%
  arrange(.pred) %>%
  mutate(residual_pct = (logRR - .pred) / .pred) %>%
  select(.pred, residual_pct) %>%
  ggplot(aes(x = .pred, y = residual_pct)) +
  geom_point() +
  xlab("Predicted logRR") +
  ylab("Residual (%)") 

# -----------------------------------------------
  
rf_cv_test_pred <- 
  predict(rf_final_model_cv %>% 
  fit(formula = logRR ~ ., data = rf_xgboost_test_baked), 
        new_data = rf_xgboost_test_baked) %>%
  bind_cols(af.test %>% dplyr::select(logRR))


# rf_cv_test_pred %>% rmse(logRR, .pred)
saveRDS(rf_cv_test_pred, file = here::here("TidyMod_OUTPUT", "rf_cv_test_pred.RDS"))

rf_cv_test_pred_hexplot <- 
  rf_cv_test_pred %>% 
  ggplot(aes(x = logRR, y = .pred)) +
  geom_abline(col = "black", lty = 1, size = 1) +
  geom_hex(alpha = 0.8, bins = 50) +
  geom_jitter(size = 0.8, width = 0.05, height = 0.05, shape = 3, alpha = 0.6) +
  coord_fixed(ratio = 0.8) +
  xlab("Observed logRR") +
  ylab("Predicted logRR") +
  scale_fill_viridis() + 
  ggtitle("Pred vs. obs logRR, model: RF with normal cv-fold") +
  theme_lucid(plot.title.size = 10)

saveRDS(rf_cv_test_pred_hexplot, file = here::here("TidyMod_OUTPUT", "rf_cv_test_pred_hexplot.RDS"))


# XGBoost (xgb) --##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##--##-- 


xgb_fit_cv_train_pred <- xgb_final_model_cv %>%      # fit the normal cv_folds model on all the training data
  fit(formula = logRR ~ . ,data = rf_xgboost_train_baked) %>%      # predict logRR for the training data
  predict(new_data = rf_xgboost_train_baked) %>%
  bind_cols(training(af.split)) 

# Evaluating model by measuring the accuracy of the model on training using `yardstick`
xgb_cv_score_train <- xgb_fit_cv_train_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Evaluate model on the testing data  -----------------------------------------------------------------------

xgb_fit_cv_test_pred <- xgb_final_model_cv %>%                      # fit the normal cv_folds model on all the test data
  fit(formula = logRR ~ ., data    = rf_xgboost_test_baked) %>%       # use the training model fit to predict the test data
  predict(new_data = rf_xgboost_test_baked) %>%
  bind_cols(testing(af.split))

# Evaluating model by measuring the accuracy of the model on testing data using `yardstick`
xgb_cv_score_test <- xgb_fit_cv_test_pred %>%
  yardstick::metrics(logRR, .pred) %>%
  mutate(.estimate = format(round(.estimate, 2), big.mark = ","))

# Visualising residuals
xgb_cv_pred_resid <- 
  xgb_fit_cv_test_pred %>%
  arrange(.pred) %>%
  mutate(residual_pct = (logRR - .pred) / .pred) %>%
  select(.pred, residual_pct) %>%
  ggplot(aes(x = .pred, y = residual_pct)) +
  geom_point() +
  xlab("Predicted logRR") +
  ylab("Residual (%)") 

# -----------------------------------------------
  
xgb_cv_test_pred <- 
  predict(xgb_final_model_cv %>% 
  fit(formula = logRR ~ ., data = rf_xgboost_test_baked), 
        new_data = rf_xgboost_test_baked) %>%
  bind_cols(af.test %>% dplyr::select(logRR))


# xgb_cv_test_pred %>% rmse(logRR, .pred)
saveRDS(xgb_cv_test_pred, file = here::here("TidyMod_OUTPUT", "xgb_cv_test_pred.RDS"))

xgb_cv_test_pred_hexplot <- 
  xgb_cv_test_pred %>% 
  ggplot(aes(x = logRR, y = .pred)) +
  geom_abline(col = "black", lty = 1, size = 1) +
  geom_hex(alpha = 0.8, bins = 50) +
  geom_jitter(size = 0.8, width = 0.05, height = 0.05, shape = 3, alpha = 0.6) +
  coord_fixed(ratio = 0.8) +
  xlab("Observed logRR") +
  ylab("Predicted logRR") +
  scale_fill_viridis() + 
  ggtitle("Pred vs. obs logRR, model: xgb with spatial clustering cv-fold") +
  theme_lucid(plot.title.size = 10)

saveRDS(xgb_cv_test_pred_hexplot, file = here::here("TidyMod_OUTPUT", "xgb_cv_test_pred_hexplot.RDS"))

# -----

# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Adding missing grouping variables: `ERA_Agroforestry`
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Adding missing grouping variables: `ERA_Agroforestry`
# Adding missing grouping variables: `ERA_Agroforestry`
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Adding missing grouping variables: `ERA_Agroforestry`
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Warning in predict.lm(object = object$fit, newdata = new_data, type = "response") :
#   prediction from a rank-deficient fit may be misleading
# Adding missing grouping variables: `ERA_Agroforestry`
# Adding missing grouping variables: `ERA_Agroforestry`

# -----
Show code
##############################################################################################################
# Spatial clustering cv-fold 
lm_spatcv_test_pred %>% rmse(logRR, .pred)
spline_spatcv_test_pred %>% rmse(logRR, .pred)
rf_spatcv_test_pred %>% rmse(logRR, .pred)
xgb_spatcv_test_pred %>% rmse(logRR, .pred)

##############################################################################################################
# Normal (random) cv-fold
lm_cv_test_pred %>% rmse(logRR, .pred)
spline_cv_test_pred %>% rmse(logRR, .pred)
rf_cv_test_pred %>% rmse(logRR, .pred)
xgb_cv_test_pred %>% rmse(logRR, .pred)

xgb_spatcv_test_pred_hexplot
##############################################################################################################
# Loading data
lm_spatcv_test_pred <- readRDS(file = here::here("TidyMod_OUTPUT", "lm_spatcv_test_pred.RDS"))
spline_spatcv_test_pred <- readRDS(file = here::here("TidyMod_OUTPUT", "spline_spatcv_test_pred.RDS"))
rf_spatcv_test_pred <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_spatcv_test_pred.RDS"))
xgb_spatcv_test_pred <- readRDS(file = here::here("TidyMod_OUTPUT", "xgb_spatcv_test_pred.RDS"))

lm_cv_test_pred <- readRDS(file = here::here("TidyMod_OUTPUT", "lm_cv_test_pred.RDS"))
spline_cv_test_pred <- readRDS(file = here::here("TidyMod_OUTPUT", "spline_cv_test_pred.RDS"))
rf_cv_test_pred <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_cv_test_pred.RDS"))
xgb_cv_test_pred <- readRDS(file = here::here("TidyMod_OUTPUT", "xgb_cv_test_pred.RDS"))
# Metrics (yardstick metrics set)

metric_set <- yardstick::metric_set(rmse, mae, ccc, rsq, rpd, rpiq)

##############################################################################################################
# Spatial clustering cv-fold 

lm_spatcv_metric <- lm_spatcv_test_pred %>% metric_set(logRR, .pred) 
spline_spatcv_metric <- spline_spatcv_test_pred %>% metric_set(logRR, .pred) 
rf_spatcv_metric <- rf_spatcv_test_pred %>% metric_set(logRR, .pred) 
xgb_spatcv_metric <- xgb_spatcv_test_pred %>% metric_set(logRR, .pred) 

models_spatcv_metrics <- 
  lm_spatcv_metric %>%
  left_join(spline_spatcv_metric, by = c(".estimator", ".metric")) %>%
  left_join(rf_spatcv_metric, by = c(".estimator", ".metric")) %>%
  left_join(xgb_spatcv_metric, by = c(".estimator", ".metric")) %>%
  rename(lm_spatcv_est = ".estimate.x",
         spline_spatcv_est = ".estimate.y",
         rf_spatcv_est = ".estimate.x.x",
         xgb_spatcv_est = ".estimate.y.y") %>%
  dplyr::select(!.estimator)

##############################################################################################################
# Normal (random) cv-fold
lm_cv_metric <- lm_cv_test_pred %>% metric_set(logRR, .pred)
spline_cv_metrics <- spline_cv_test_pred %>% metric_set(logRR, .pred)
rf_cv_metrics <- rf_cv_test_pred %>% metric_set(logRR, .pred)
xgb_cv_metrics <- xgb_cv_test_pred %>% metric_set(logRR, .pred)

models_cv_metrics <- 
  lm_cv_metric %>%
  left_join(spline_cv_metrics, by = c(".estimator", ".metric")) %>%
  left_join(rf_cv_metrics, by = c(".estimator", ".metric")) %>%
  left_join(xgb_cv_metrics, by = c(".estimator", ".metric")) %>%
  rename(lm_cv_est = ".estimate.x",
         spline_cv_est = ".estimate.y",
         rf_cv_est = ".estimate.x.x",
         xgb_cv_est = ".estimate.y.y") %>%
  dplyr::select(!.estimator)

# -----------------------------------------------------------------------------------------------------------
models_metrics <- 
  models_spatcv_metrics %>% 
  left_join(models_cv_metrics, by = ".metric")

models_metrics 
# A tibble: 6 × 9
  .metric lm_spatcv_est spline_spatcv_est rf_spatcv_est xgb_spatcv_est
  <chr>           <dbl>             <dbl>         <dbl>          <dbl>
1 rmse            0.505             0.469         0.505         0.736 
2 mae             0.364             0.322         0.367         0.556 
3 ccc             0.698             0.750         0.681         0.0399
4 rsq             0.537             0.601         0.541         0.241 
5 rpd             1.47              1.58          1.47          1.01  
6 rpiq            1.53              1.65          1.53          1.05  
# … with 4 more variables: lm_cv_est <dbl>, spline_cv_est <dbl>,
#   rf_cv_est <dbl>, xgb_cv_est <dbl>

Evaluating all models on model metrics

Show code
##############################################################################################################
# Spatial k-fold cross validation
##############################################################################################################


# TRAINING DATA
# Metrics for all the models on the training data ---------------------- ---------------------- ----------------------

lm_spatcv_train_pred_metric_table <- 
  lm_fit_spatcv_train_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "lm_spatcv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

spline_spatcv_train_pred_metric_table <- 
  spline_fit_spatcv_train_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "spline_spatcv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

rf_spatcv_train_pred_metric_table <- 
  rf_fit_spatcv_train_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "rf_spatcv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

xgb_spatcv_train_pred_metric_table <- 
  xgb_fit_spatcv_train_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "xgb_spatcv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)


all_models_spatcv_training_data_metric_rmse <- bind_rows(lm_spatcv_train_pred_metric_table,
                                                 spline_spatcv_train_pred_metric_table,
                                                 rf_spatcv_train_pred_metric_table,
                                                 xgb_spatcv_train_pred_metric_table) %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::arrange(-desc(.estimate))

all_models_spatcv_training_data_metric_rmse

# Saving data   
saveRDS(all_models_spatcv_training_data_metric_rmse, here::here("TidyMod_OUTPUT", "all_models_spatcv_training_data_metric_rmse.RDS"))


# TESTING DATA
# Metrics for all the models on the testing data ---------------------- ---------------------- ----------------------

lm_spatcv_test_pred_metric_table <- 
  lm_fit_spatcv_test_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "lm_spatcv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

spline_spatcv_test_pred_metric_table <- 
  spline_fit_spatcv_test_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "spline_spatcv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

rf_spatcv_test_pred_metric_table <- 
  rf_fit_spatcv_test_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "rf_spatcv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

xgb_spatcv_test_pred_metric_table <- 
  xgb_fit_spatcv_test_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "xgb_spatcv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)


all_models_spatcv_testing_data_metric_rmse <- bind_rows(lm_spatcv_test_pred_metric_table,
                                                 spline_spatcv_test_pred_metric_table,
                                                 rf_spatcv_test_pred_metric_table,
                                                 xgb_spatcv_test_pred_metric_table) %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::arrange(-desc(.estimate))

all_models_spatcv_testing_data_metric_rmse

# Saving data   
saveRDS(all_models_spatcv_testing_data_metric_rmse, here::here("TidyMod_OUTPUT", "all_models_spatcv_testing_data_metric_rmse.RDS"))




##############################################################################################################
# Normal (random) k-fold cross validation (CV-fold)
##############################################################################################################

# TRAINING DATA
# Metrics for all the models on the training data ---------------------- ---------------------- ----------------------

lm_cv_train_pred_metric_table <- 
  lm_fit_cv_train_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "lm_cv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

spline_cv_train_pred_metric_table <- 
  spline_fit_cv_train_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "spline_cv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

rf_cv_train_pred_metric_table <- 
  rf_fit_cv_train_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "rf_cv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

xgb_cv_train_pred_metric_table <- 
  xgb_fit_cv_train_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "xgb_cv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)


all_models_cv_training_data_metric_rmse <- bind_rows(lm_cv_train_pred_metric_table,
                                                 spline_cv_train_pred_metric_table,
                                                 rf_cv_train_pred_metric_table,
                                                 xgb_cv_train_pred_metric_table) %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::arrange(-desc(.estimate))

all_models_cv_training_data_metric_rmse

# Saving data   
saveRDS(all_models_cv_training_data_metric_rmse, here::here("TidyMod_OUTPUT", "all_models_cv_training_data_metric_rmse.RDS"))


# TESTING DATA
# Metrics for all the models on the testing data ---------------------- ---------------------- ----------------------

lm_cv_test_pred_metric_table <- 
  lm_fit_cv_test_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "lm_cv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

spline_cv_test_pred_metric_table <- 
  spline_fit_cv_test_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "spline_cv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

rf_cv_test_pred_metric_table <- 
  rf_fit_cv_test_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "rf_cv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)

xgb_cv_test_pred_metric_table <- 
  xgb_fit_cv_test_pred %>% metric_set(logRR, .pred) %>%
  mutate(model_and_cv_type = "xgb_cv") %>%
  select(-.estimator) %>%
  relocate(model_and_cv_type, .metric, .estimate)


all_models_cv_testing_data_metric_rmse <- bind_rows(lm_cv_test_pred_metric_table,
                                                 spline_cv_test_pred_metric_table,
                                                 rf_cv_test_pred_metric_table,
                                                 xgb_cv_test_pred_metric_table) %>%
  dplyr::filter(.metric == "rmse") %>%
  dplyr::arrange(-desc(.estimate))

all_models_cv_testing_data_metric_rmse

# Saving data   
saveRDS(all_models_cv_testing_data_metric_rmse, here::here("TidyMod_OUTPUT", "all_models_cv_testing_data_metric_rmse.RDS"))

Resampling: spatial clustering cv-folds, metric: RMSE

Show code
### Training data
all_models_spatcv_training_data_metric_rmse <- readRDS(file = here::here("TidyMod_OUTPUT","all_models_spatcv_training_data_metric_rmse.RDS"))
### Testing data
all_models_spatcv_testing_data_metric_rmse <- readRDS(file = here::here("TidyMod_OUTPUT","all_models_spatcv_testing_data_metric_rmse.RDS"))

# Combine models and metric performance -------------- -------------- -------------- -------------- -------------- -------------- --------------
spatcv_model_performance_rmse <- 
  all_models_spatcv_training_data_metric_rmse %>%
  left_join(all_models_spatcv_testing_data_metric_rmse, by = c(".metric", "model_and_cv_type")) %>%
  rename(Training_est = ".estimate.x",
         Testing_est = ".estimate.y") %>%
  dplyr::select(!.metric)

rmarkdown::paged_table(spatcv_model_performance_rmse)

Resampling: normal cv-folds, metric: RMSE

Show code
### Training data
all_models_cv_training_data_metric_rmse <- readRDS(file = here::here("TidyMod_OUTPUT","all_models_cv_training_data_metric_rmse.RDS"))
### Testing data
all_models_cv_testing_data_metric_rmse <- readRDS(file = here::here("TidyMod_OUTPUT","all_models_cv_testing_data_metric_rmse.RDS"))

# Combine models and metric performance -------------- -------------- -------------- -------------- -------------- -------------- --------------
cv_model_performance_rmse <- 
  all_models_cv_training_data_metric_rmse %>%
  left_join(all_models_cv_testing_data_metric_rmse, by = c(".metric", "model_and_cv_type")) %>%
  rename(Training_est = ".estimate.x",
         Testing_est = ".estimate.y") %>%
  dplyr::select(!.metric)

rmarkdown::paged_table(cv_model_performance_rmse)

Interesting and very counter-intuitive the models all seem to perform better on the testing set compared to the training set. This is not common, as normally models will perform better on the part of the sata that they have also been trained with/on..

Predicted vs. observed logRR

Resampling: spatial clustering cv-folds

Show code
lm_spatcv_test_pred_hexplot <- readRDS(file = here::here("TidyMod_OUTPUT", "lm_spatcv_test_pred_hexplot.RDS"))
spline_spatcv_test_pred_hexplot <- readRDS(file = here::here("TidyMod_OUTPUT", "spline_spatcv_test_pred_hexplot.RDS"))
rf_spatcv_test_pred_hexplot <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_spatcv_test_pred_hexplot.RDS"))
xgb_spatcv_test_pred_hexplot <- readRDS(file = here::here("TidyMod_OUTPUT", "xgb_spatcv_test_pred_hexplot.RDS"))

all_spatcv_pred_vs_obs_logRR_plot <- 
  gridExtra::grid.arrange(lm_spatcv_test_pred_hexplot + theme(legend.position="none"), 
                          spline_spatcv_test_pred_hexplot + theme(legend.position="none"), 
                          rf_spatcv_test_pred_hexplot + theme(legend.position="none"),
                          xgb_spatcv_test_pred_hexplot + theme(legend.position="none"),
                          ncol = 2,
                          nrow = 2)
Predicted vs. observed logRR values, resampling: Spatial clustering cv-folds

Figure 34: Predicted vs. observed logRR values, resampling: Spatial clustering cv-folds

Resampling: normal cv-folds

Show code
lm_cv_test_pred_hexplot <- readRDS(file = here::here("TidyMod_OUTPUT", "lm_cv_test_pred_hexplot.RDS"))
spline_cv_test_pred_hexplot <- readRDS(file = here::here("TidyMod_OUTPUT", "spline_cv_test_pred_hexplot.RDS"))
rf_cv_test_pred_hexplot <- readRDS(file = here::here("TidyMod_OUTPUT", "rf_cv_test_pred_hexplot.RDS"))
xgb_cv_test_pred_hexplot <- readRDS(file = here::here("TidyMod_OUTPUT", "xgb_cv_test_pred_hexplot.RDS"))

all_cv_pred_vs_obs_logRR_plot <- 
  gridExtra::grid.arrange(lm_cv_test_pred_hexplot + theme(legend.position="none"), 
                          spline_cv_test_pred_hexplot + theme(legend.position="none"), 
                          rf_cv_test_pred_hexplot + theme(legend.position="none"),
                          xgb_cv_test_pred_hexplot + theme(legend.position="none"),
                          ncol = 2,
                          nrow = 2)
Predicted vs. observed logRR values, resampling: Normal cv-folds

Figure 35: Predicted vs. observed logRR values, resampling: Normal cv-folds

STEP 11: Ensemble stacked modelling

Ensemble model, resampling: Spatial clustering cv-folds

The blend_predictions function determines how member model output will ultimately be combined in the final prediction, and is how we’ll calculate our stacking coefficients. Now that we know how to combine our model output, we can fit the models that we now know we need on the full training set. Any candidate ensemble member that has a stacking coefficient of zero doesn’t need to be refitted!

stacked_model_spatcv <- stacks::stacks() %>% 
  add_candidates(lm_fit_spatcv) %>% 
  add_candidates(glm_fit_spatcv) %>%
  add_candidates(spline_fit_spatcv) %>%
  add_candidates(knn_fit_spatcv) %>% 
  add_candidates(pca_fit_spatcv) %>% 
  add_candidates(rf_param_fit_spatcv) %>% 
  add_candidates(xgboost_param_fit_spatcv) %>% 
  add_candidates(catboost_param_fit_spatcv) %>%
  add_candidates(lightgmb_param_fit_spatcv) %>%
  blend_predictions(non_negative = TRUE, 
                    penalty = 10^(-3:1), 
                    mixture = 0.3,
                    metric = multi.metric) %>% # evaluate candidate models
  fit_members() # fit non zero stacking coefficients

# Saving the stacked model ------------------ ------------------ ------------------ ------------------ ------------------
saveRDS(stacked_model_spatcv, file = here::here("TidyMod_OUTPUT", "stacked_model_spatcv.RDS"))

# ── A stacked ensemble model ─────────────────────────────────────
# 
# Out of 90 possible candidate members, the ensemble retained 2.
# Penalty: 0.01.
# Mixture: 0.3.
# 
# The 2 highest weighted members are:

Now that we’ve fitted the needed ensemble members, our model stack is ready to go! For the most part, a model stack is just a list that contains a bunch of ensemble members and instructions on how to combine their predictions.

Show code
# Loading the stacked model
stacked_model_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "stacked_model_spatcv.RDS"))

stacked_model_spatcv 
# A tibble: 2 × 3
  member                         type       weight
  <chr>                          <chr>       <dbl>
1 lightgmb_param_fit_spatcv_1_14 boost_tree 0.624 
2 xgboost_param_fit_spatcv_1_16  boost_tree 0.0190

Stacking coeffecients

These stacking coefficients determine which candidate ensemble members will become ensemble members. Candidates with non-zero stacking coefficients are then fitted on the whole training set, altogether making up a model_stack object. We see that our model have choosen only two model candidates (LightGMB and XGBoost) out of 90 possible model candidates - and both of these two are tree-based models. 90 because each model has various unique (hyper)-parameter settings/combinations, resulting in several more candidates than actual model algorithms.

Show code
stacks::autoplot(stacked_model_spatcv, type = "weights") +
  see::theme_lucid() +
  ggtitle("Predicted vs. observed for stacked model fit using spatial clustering cv-folds") +
  theme_lucid(plot.title.size = 25)
Stacking coeffecients, resampling: Spatial clustering cv-folds

Figure 36: Stacking coeffecients, resampling: Spatial clustering cv-folds

Relation/trade-off between performance and the number of member candidates

To make sure that we have the right trade-off between minimizing the number of members and optimizing performance of the stacked model, we can use the stacks::autoplot() function. Below we see that the most optimal amount of members is also related to the LASSO penalty we give the stacking method as an argument.

Show code
stacks::autoplot(stacked_model_spatcv)
Trade-off between performance and the number of member candidates, resampling: Spatial clustering cv-folds

Figure 37: Trade-off between performance and the number of member candidates, resampling: Spatial clustering cv-folds

We can visualize this more clearly using the stacks::autoplot() argument, type and choose “members.”

Show code
stacks::autoplot(stacked_model_spatcv, type = "members") +
  geom_line()
Number of member candidates, resampling: Spatial clustering cv-folds

Figure 38: Number of member candidates, resampling: Spatial clustering cv-folds

With our two members we have indeed accomplished the best performance.

Visualising predicted vs. observed for the stacked model

Let us now finally visualise the relationship between predicted and observed logRR values for the ensemble stacked model when we use the stacked model to predict on our testing set.

# Pulling the testing set of the initial data splitting
test_data <- testing(af.split)

# Getting the predictions on the testing data using the Ensemble Stacked model
mod_pred_spatcv <- predict(stacked_model_spatcv, test_data) %>%
  bind_cols(test_data)


stacked_pred_vs_obs_spatcv <- 
  ggplot(mod_pred_spatcv, aes(x = logRR, y = .pred)) +
  #geom_point(alpha = 0.2) +
  coord_obs_pred() +
  labs(x = "Observed logRR", y = "Predicted logRR") +
  #geom_abline(linetype = "dashed", col = "red", lty = 1, size = 1) +
  geom_smooth(method = "gam", linetype = "dashed", col = "black", lty = 1, size = 1) +
  geom_jitter(width = 0.2, height = 0.2, size = 0.2, shape = 10) +
  coord_fixed(ratio = 0.8) +
  geom_hex(bins = 50, alpha = 1, show.legend = FALSE) +
  scale_fill_continuous(type = "viridis") +
  ggtitle("Predicted vs. observed for stacked model fit using spatial clustering cv-folds") +
  theme_lucid(plot.title.size = 25)

# Saving the stacked plot ------------------ ------------------ ------------------ ------------------ ------------------
saveRDS(stacked_pred_vs_obs_spatcv, file = here::here("TidyMod_OUTPUT", "stacked_pred_vs_obs_spatcv.RDS"))
Show code
stacked_pred_vs_obs_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "stacked_pred_vs_obs_spatcv.RDS"))

stacked_pred_vs_obs_spatcv
Predicted vs. observed for the stacked model, resampling: Spatial clustering cv-folds

Figure 39: Predicted vs. observed for the stacked model, resampling: Spatial clustering cv-folds

Another simpler way of doing this could be using ggplot:

Show code
 agrofor_test_spatcv <- 
   af.test %>%
   bind_cols(predict(stacked_model_spatcv, .))

pred_vs_obs_simple_spatcv <- 
  ggplot(agrofor_test_spatcv) +
  aes(x = logRR, 
      y = .pred) +
  geom_point() + 
  coord_obs_pred()

# Saving the stacked plot ------------------ ------------------ ------------------ ------------------ ------------------
saveRDS(pred_vs_obs_simple_spatcv, file = here::here("TidyMod_OUTPUT", "pred_vs_obs_simple_spatcv.RDS"))
Show code
pred_vs_obs_simple_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "pred_vs_obs_simple_spatcv.RDS"))

pred_vs_obs_simple_spatcv
Simple plot for pred vs. obs for the stacked model, resampling: Spatial clustering cv-folds

Figure 40: Simple plot for pred vs. obs for the stacked model, resampling: Spatial clustering cv-folds

Looks like our predictions were pretty strong! How do the stacks predictions perform, though, as compared to the members’ predictions? We can use the type = “members” argument to generate predictions from each of the ensemble members.

Lares regression performance plot, ensemble stacked model fit with spatial clustering cv-folds

The lares package is an R package built to automate, improve, and speed everyday analysis and machine learning tasks, build specifically for visualising regression and classification model performance by easily and intuitively giving most important plots, all generated at once. Lets see how our stacked model performs using the lares package function mplot_full(). This function plots a whole dashboard with a model’s results. It will automatically detect if it’s a categorical or regression’s model by checking how many different unique values the independent variable (tag) has. This will give us multiple plots gathered into one, showing tag vs score performance results.

Show code
agfor_stacked_model_full_lares_plot_spatcv <- 
  lares::mplot_full(tag =  mod_pred_spatcv$logRR,
                  score = mod_pred_spatcv$.pred,
                  splits = 10,
                  subtitle = "Results of Regression of Ensemble Stacked Model on models fit with spatial clustering cv-folds",
                  model_name = "Ensemble Stacked Model with models fit using spatial clustering cv-folds",
                  save = TRUE,
                  file_name = "agfor_stacked_model_full_lares_plot_spatcv.png",
                  plot = FALSE) 
Show code
ggdraw() + 
  draw_image(here::here("TidyMod_OUTPUT", "agfor_stacked_model_full_lares_plot_spatcv.png")) 
Full lares plot for the ensemble stacked model with spatial clustering cv-folds

Figure 41: Full lares plot for the ensemble stacked model with spatial clustering cv-folds

Show code
agfor_stacked_model_linear_lares_plot_spatcv <- 
  lares::mplot_lineal(tag = mod_pred_spatcv$logRR, 
                    score = mod_pred_spatcv$.pred,
                    subtitle = "Regression of Ensemble Stacked Model, resampling: spatial clustering cv-folds",
                    model_name = "Regression of Ensemble Stacked Model with only linear",
                    save = TRUE,
                    file_name = "agfor_stacked_model_linreg_lares_plot_spatcv.png")
Show code
ggdraw() + 
  draw_image(here::here("TidyMod_OUTPUT", "agfor_stacked_model_linreg_lares_plot_spatcv.png")) 
Lares linear regression plot for the ensemble stacked model with spatial clustering cv-folds

Figure 42: Lares linear regression plot for the ensemble stacked model with spatial clustering cv-folds

Model performance in splits by quantiles

One of my favourite functions to study our model’s quality is the following. What it shows is the result of arranging all scores or predicted values in sorted quantiles, from worst to best, and see how the classification goes compared to our test set. The math behind this plot might be a bit foggy so lets try to explain further: if we sort all our observations from the lowest to the highest, then we can, for instance, select the top 5 or bottom 15 % or so. What we do now is split all those “ranked” rows into similar-sized-buckets to get the best bucket, the second-best one, and so on. Then, if we split all the “goods” and the “bads” into five columns, keeping their buckets’ colours, we still have it sorted and separated, right?

agfor_stacked_model_splits_lares_plot_spatcv <- 
  lares::mplot_splits(tag = mod_pred_spatcv$logRR, 
                    score = mod_pred_spatcv$.pred,
                    split = 5,
                    save = TRUE,
                    file_name = "agfor_stacked_model_splits_lares_plot_spatcv.png",
                    subtitle = "Lares split by group plot for stacked ensemble model, resampling: Spatial clustering cv-folds")
Show code
ggdraw() + 
  draw_image(here::here("TidyMod_OUTPUT", "agfor_stacked_model_splits_lares_plot_spatcv.png")) 
Lares linear regression plot for the ensemble stacked model with normal cv-folds

Figure 43: Lares linear regression plot for the ensemble stacked model with normal cv-folds

Comparing RMSE (or any other metric) of individual models with the ensemble stacked model

Now, let us evaluate the root mean squared error from each model compared to the ensemble stacked model

member_preds_spatcv <- 
  af.test %>%
  select(logRR) %>%
  bind_cols(predict(stacked_model_spatcv, af.test, members = TRUE))


member_map_preds_spatcv <- 
  map_dfr(member_preds_spatcv, rmse, truth = logRR, data = member_preds_spatcv) %>%
  mutate(member = colnames(member_preds_spatcv))

# Saving the stacked data for member_map_preds_spatcv ------------------ ------------------ ------------------ ------------------ ------------------
saveRDS(member_map_preds_spatcv, file = here::here("TidyMod_OUTPUT", "member_map_preds_spatcv.RDS"))
Show code
member_map_preds_spatcv <- readRDS(file = here::here("TidyMod_OUTPUT", "member_map_preds_spatcv.RDS"))

rmarkdown::paged_table(member_map_preds_spatcv)

Ensemble model, resampling: Normal cv-folds

Show code
stacked_model_cv <- stacks::stacks() %>% 
  add_candidates(lm_fit_cv) %>% 
  add_candidates(glm_fit_cv) %>%
  add_candidates(spline_fit_cv) %>%
  add_candidates(knn_fit_cv) %>% 
  add_candidates(pca_fit_cv) %>% 
  add_candidates(rf_param_fit_cv) %>% 
  add_candidates(xgboost_param_fit_cv) %>% 
  add_candidates(catboost_param_fit_cv) %>%
  add_candidates(lightgmb_param_fit_cv) %>%
  blend_predictions(non_negative = TRUE, 
                    penalty = 10^(-3:1), 
                    mixture = 1,
                    metric = multi.metric) %>% # evaluate candidate models
  fit_members() # fit non zero stacking coefficients

# Saving the stacked model ------------------ ------------------ ------------------ ------------------ ------------------
saveRDS(stacked_model_cv, file = here::here("TidyMod_OUTPUT", "stacked_model_cv.RDS"))

# ── A stacked ensemble model ─────────────────────────────────────
# 
# Out of 90 possible candidate members, the ensemble retained 9.
# Penalty: 0.01.
# Mixture: 1.
# 
# The 9 highest weighted members are:
Show code
stacked_model_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "stacked_model_cv.RDS"))

stacked_model_cv 
# A tibble: 9 × 3
  member                     type              weight
  <chr>                      <chr>              <dbl>
1 xgboost_param_fit_cv_1_20  boost_tree  3234.       
2 xgboost_param_fit_cv_1_17  boost_tree     6.60     
3 rf_param_fit_cv_1_03       rand_forest    0.259    
4 xgboost_param_fit_cv_1_16  boost_tree     0.220    
5 xgboost_param_fit_cv_1_15  boost_tree     0.126    
6 rf_param_fit_cv_1_19       rand_forest    0.121    
7 lightgmb_param_fit_cv_1_15 boost_tree     0.0981   
8 spline_fit_cv_1_1          linear_reg     0.00777  
9 rf_param_fit_cv_1_10       rand_forest    0.0000798
Show code
stacks::autoplot(stacked_model_cv)
Stacking coeffecients, resampling: Normal cv-folds

Figure 44: Stacking coeffecients, resampling: Normal cv-folds

Show code
stacks::autoplot(stacked_model_cv, type = "members") +
  geom_line()
Trade-off between performance and the number of member candidates, resampling: Normal cv-folds

Figure 45: Trade-off between performance and the number of member candidates, resampling: Normal cv-folds

We can visualize this more clearly using the stacks::autoplot() argument, type and choose “members.”

Show code
stacks::autoplot(stacked_model_cv, type = "weights") +
  scale_x_continuous(trans = 'log2') +
  see::theme_lucid()
Number of member candidates, resampling: Normal cv-folds

Figure 46: Number of member candidates, resampling: Normal cv-folds

We see that there are some members (model candidates) that have been included even though they are negatively contributing to the stacks. There can be several logical reasons for this.

Visualising predicted vs. observed for the stacked model

# Pulling the testing set of the initial data splitting
test_data <- testing(af.split)

# Getting the predictions on the testing data using the Ensemble Stacked model
mod_pred_cv <- predict(stacked_model_cv, test_data) %>%
  bind_cols(test_data)


stacked_pred_vs_obs_cv <- 
  ggplot(mod_pred_cv, aes(x = logRR, y = .pred)) +
  #geom_point(alpha = 0.2) +
  coord_obs_pred() +
  labs(x = "Observed logRR", y = "Predicted logRR") +
  #geom_abline(linetype = "dashed", col = "red", lty = 1, size = 1) +
  geom_smooth(method = "gam", linetype = "dashed", col = "black", lty = 1, size = 1) +
  geom_jitter(width = 0.2, height = 0.2, size = 0.2, shape = 10) +
  coord_fixed(ratio = 0.8) +
  geom_hex(bins = 50, alpha = 1, show.legend = FALSE) +
  scale_fill_continuous(type = "viridis") +
  ggtitle("Predicted vs. observed for stacked model fit using normal cv-folds") +
  theme_lucid(plot.title.size = 25)

# Saving the stacked plot ------------------ ------------------ ------------------ ------------------ ------------------
saveRDS(stacked_pred_vs_obs_cv, file = here::here("TidyMod_OUTPUT", "stacked_pred_vs_obs_cv.RDS"))
Show code
stacked_pred_vs_obs_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "stacked_pred_vs_obs_cv.RDS"))

stacked_pred_vs_obs_cv
Predicted vs. observed for the stacked model, resampling: Normal cv-folds

Figure 47: Predicted vs. observed for the stacked model, resampling: Normal cv-folds

Another simpler way of doing this could be using ggplot:

Show code
 agrofor_test_cv <- 
   af.test %>%
   bind_cols(predict(stacked_model_cv, .))

pred_vs_obs_simple_cv <- 
  ggplot(agrofor_test_cv) +
  aes(x = logRR, 
      y = .pred) +
  geom_point() + 
  coord_obs_pred()

# Saving the stacked plot ------------------ ------------------ ------------------ ------------------ ------------------
saveRDS(pred_vs_obs_simple_cv, file = here::here("TidyMod_OUTPUT", "pred_vs_obs_simple_cv.RDS"))
Show code
pred_vs_obs_simple_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "pred_vs_obs_simple_cv.RDS"))

pred_vs_obs_simple_cv
Simple plot for pred vs. obs for the stacked model, resampling: Normal cv-folds

Figure 48: Simple plot for pred vs. obs for the stacked model, resampling: Normal cv-folds

Lares regression performance plot, ensemble stacked model fit with normal cv-folds

agfor_stacked_model_full_lares_plot_cv <- 
  lares::mplot_full(tag =  mod_pred_cv$logRR,
                  score = mod_pred_cv$.pred,
                  splits = 10,
                  subtitle = "Results of Regression of Ensemble Stacked Model on models fit with normal cv-folds",
                  model_name = "Ensemble Stacked Model with models fit using normal cv-folds",
                  save = TRUE,
                  file_name = "agfor_stacked_model_full_lares_plot_cv.png",
                  plot = FALSE) 
Show code
ggdraw() + 
  draw_image(here::here("TidyMod_OUTPUT","agfor_stacked_model_full_lares_plot_cv.png")) 
Full lares plot for the ensemble stacked model with normal cv-folds

Figure 49: Full lares plot for the ensemble stacked model with normal cv-folds

Show code
agfor_stacked_model_linear_lares_plot_cv <- 
  lares::mplot_lineal(tag = mod_pred_cv$logRR, 
                    score = mod_pred_cv$.pred,
                    subtitle = "Regression of Ensemble Stacked Model, resampling: Normal cv-folds",
                    model_name = "Regression of Ensemble Stacked Model with only linear",
                    save = TRUE,
                    file_name = "agfor_stacked_model_linreg_lares_plot_cv.png")
Show code
ggdraw() + 
  draw_image(here::here("TidyMod_OUTPUT", "agfor_stacked_model_linreg_lares_plot_cv.png")) 
Full lares plot for the ensemble stacked model with normal cv-folds

Figure 50: Full lares plot for the ensemble stacked model with normal cv-folds

Lares splits for the models fit with normal cv-folds

agfor_stacked_model_split_lares_plot_cv <- 
  lares::mplot_splits(tag = mod_pred_cv$logRR, 
                    score = mod_pred_cv$.pred,
                    split = 5,
                    save = TRUE,
                    file_name = "agfor_stacked_model_splits_lares_plot_cv.png",
                    subtitle = "Lares split by group plot for stacked ensemble model, resampling: Normal cv-folds")
Show code
ggdraw() + 
  draw_image(here::here("TidyMod_OUTPUT","agfor_stacked_model_splits_lares_plot_cv.png")) 
Splits lares plot for the ensemble stacked model with normal cv-folds

Figure 51: Splits lares plot for the ensemble stacked model with normal cv-folds

Comparing RMSE (or any other metric) of individual models with the ensemble stacked model

Now, let us evaluate the root mean squared error from each model compared to the ensemble stacked model

member_preds_cv <- 
  af.test %>%
  select(logRR) %>%
  bind_cols(predict(stacked_model_cv, af.test, members = TRUE))

member_map_preds_cv <- 
  map_dfr(member_preds_cv, rmse, truth = logRR, data = member_preds_cv) %>%
  mutate(member = colnames(member_preds_cv))

# Saving the stacked data for member_map_preds_spatcv ------------------ ------------------ ------------------ ------------------ ------------------
saveRDS(member_map_preds_cv, file = here::here("TidyMod_OUTPUT", "member_map_preds_cv.RDS"))
Show code
member_map_preds_cv <- readRDS(file = here::here("TidyMod_OUTPUT", "member_map_preds_cv.RDS"))

rmarkdown::paged_table(member_map_preds_cv)

As we can see, the stacked ensemble outperforms each of the member models, though is closely followed by one of its members.

Voila! We’ve now made use of the stacks package to predict response ratioe outcomes of ERA agroforestry data observations from ERA using a stacked ensemble modelling approach!


  1. Amazon Elastic Compute Cloud is a part of Amazon.com’s cloud-computing platform, Amazon Web Services (AWS), that allows users to rent virtual computers on which to run their own computer applications.↩︎

References